In the previous segment, you learnt about Shuffles as well as the techniques to reduce Shuffles in Spark jobs.
In the next video, our SME will discuss in detail how joins affect Network IO and talk about the different types of Joins available in Spark. You will also learn about the two common Joins in Spark: Shuffle Hash Join and Broadcast Join.
In the video provided above, you learnt how Joins affect Network IO and looked at the different types of Joins in Spark.
Joins is typically a process in which you try to get data from two different tables that are joined through some common columns. These tables may be present in two separate machines of a cluster. This naturally leads to the movement of data from one machine to another so that the Join Operation can be carried out. This is how Joins lead to the shuffling of data.
At the industry level, especially in ETL applications, Joins is not only one of the most common operations but also one of the heaviest operations in terms of compute load.
Traditionally, Shuffle Hash Join is normally used for joining two tables in Spark, which involves not only a lot of data to be shuffled but also Hash tables to be created, which makes the Join process even more expensive. Let’s take a look at the image given below and determine how a Shuffle Hash Join looks.
As you can see in this image, we have two RDDs, and when we try to join both of these RDDs, we need to find all the combinations between RDD1 and RDD2, which require a lot of data to be shuffled across the different machines.
Shuffle Hash Joins ensure that data in each partition contains the same keys by partitioning the second table to be joined with the same partitioner as the first. Therefore, the keys with the same hash value will be present in the same partition. However, this process requires a lot of shuffles as can be seen in the image provided above.
Spark supports almost all the different types of Joins as those in a typical RDBMS, which are as follows:
- Inner-Join
- Left-Join
- Right-Join
- Outer-Join
- Cross-Join
- Left-Semi-Join
- Left-Anti-Semi-Join
In situations where we have to join two tables or RDDs, and one of the tables is big and the other is relatively small in size, then, instead of performing a blind Join that will again result in a lot of Shuffles, we can broadcast the smaller table to all the different partitions of the bigger table. Now, if we want to perform the Join, then no shuffling will be required; thus, the original parallelism of the original table or RDD will be maintained.
Typically, Broadcast Joins are extremely useful in cases where the main data table has to be combined with a side table; for example, the metadata table. Let’s take a look at the image given below to understand what actually happens during a Broadcast Join.
In the next video, you will learn how Broadcast Joins can help improve the execution time of Shuffle Hash Join by implementing both of them on the same data set in a Spark job.
The link to the Jupyter Notebook used in this segment is given below.
Note:
You may get different results when you run the Jupyter Notebooks. This may be due to changes in network bandwidth and other internal reasons.
In this session, you learnt how Joins affect Network IO and also looked at the different types of Joins available in Spark.
Additional Content
- Shuffle Hash Join vs Broadcast Join – Link to a GitHub page that discusses the difference between Shuffle Hash Join and Broadcast Join
Report an error