Understanding Shuffles

In the previous segment, you learnt about Network IO and understood how it is caused due the phenomenon of Shuffles. In this segment, you will learn about Shuffles in detail and understand what are the operations in Spark that cause Shuffles.

In the next video, our SME will discuss the concept of Shuffles in detail.

In this video, you learnt about the concept of Shuffles (Network Shuffles) in Spark.

As discussed earlier, Shuffle in spark refers to regrouping and redistributing data across various partitions and machines. A Shuffle happens whenever a wide operation occurs in spark.

Some of the operations that cause Shuffles to happen are as follow:

cogroup
join
groupByKey
reduceByKey
joinByKey
sortByKey
distinct
intersection
repartition
coalesce

As you can see, all of these operations involve the movement of data between partitions (worker nodes) in one of the steps of their function. Data has to be moved across machines so that it can be matched by their respective keys in order to perform particular functions, leading to a lot of Network IO.

The nature of distributed systems means that Shuffles are bound to happen. However, too many shuffles will lead to an extremely adverse affect on the performance of your Spark jobs.

Let’s take a look at the following image to understand Shuffles better.

As you can see in the image provided above, this is basically a groupByKey operation after which the values of similar keys are being added. There are three worker nodes where the data for two of the keys – a and b – are present. For this operation, you need to get all the key-value pairs corresponding to one key into the same machine so that the sum can be computed. This movement of data is known as a Shuffle.

Now that you know about Shuffles, let’s take a look at the following ways to reduce it.

Optimal partitioning: If you use optimal custom partitioning techniques, then a correct set of keys for the partitioning will help you avoid a significant amount of Shuffles in data, as similar data is more likely to be partitioned together in the same nodes.
Broadcast Joins: Joins is one of the most expensive operations of Network IO, and at the same time, it is extremely heavy in terms of Shuffle operations. One of the ways in which you can reduce Shuffles is that, in cases where one of the tables is smaller than the other tables, you can broadcast the smaller table to all the partitions containing the other tables in order to avoid any Shuffles.
Using reduceByKey() over groupByKey(): In cases where you need to use wide transformations, using more optimised operations such as reduceByKey() instead of groupByKey() will help in reducing shuffling to a great extent, as reduceByKey() comparatively results in much lesser Network IO overhead.

In the next video, our SME will explain how using operations such as reduceByKey() over groupByKey () can help in reducing Shuffles in Spark jobs.

The link to the Jupyter Notebook used in this segment is given below.

Note:

You may get different results when you run these Jupyter Notebooks. This may be due to changes in network bandwidth and other internal reasons.

In this session, you learnt about the concept of Shuffles and the operations that cause it. You also looked at the various techniques to reduce Shuffles in Spark Jobs and understood how using optimised wide operations such as reduceByKey() instead of groupByKey() can help in reducing Shuffles to a great extent.

Additional Content

Shuffling in Apache Spark – Link to a stack overflow article on when does Shuffling occur in Apache Spark
Comparison between reduceByKey, groupByKey, aggregateByKey and combineByKey – Link to a stack overflow article on the difference between the ByKey operations

Report an error