In the previous segment, you learnt how optimising Joins can help in reducing Joins can help in reducing Shuffles and ultimately help in reducing Network IO in a Spark Job.
In this segment, you will learn about another method that you can use to reduce Network IO in Spark jobs: Using optimised data partitioning techniques.
In the next video, our SME will discuss what partitioning is and how proper partitioning techniques can help in reducing Network IO. You will also learn about the various operations that will be benefited with partitioning as well as the different operations that can affect partitions in a Spark job.
Note: At 1:51, the point that should appear is ‘…correct partitioner…’. The word ‘partitioner’ is misspelt as ‘practitioner’.
In the video provided above, you first looked at the process of data partitioning.
Data partitioning is the process of dividing data into multiple parts. In distributed systems, whenever we need to process one big file, we typically break it into smaller parts, where each part is called a partition. This logic applies to data processing in Spark as well.
As the number of cores, executor machines and partitions increase in a distributed system, the degree of parallelism increases and directly affects the performance of a Spark job.
As data is divided into partitions, proper partitioning can help in reducing Network IO as follows:
- If you apply the correct type of partitioner and also have the optimal number of partitions, then shuffling will be drastically reduced in the case of wider transformations such as groupByKey and reduceByKey, as there is a high chance that similar data will be lying in the same machines.
- You can also make your own custom partitioners, which further help in reducing the volume of data that is supposed to be shuffled in case of Shuffles.
Moving further, the various operations that can be benefited by using optimal partitioning techniques are as follows:
- Wide transformations such as groupByKey() and reduceByKey(): As discussed earlier, different wide transformations such as groupByKey will heavily benefit from proper partitioning techniques in the performance of the job, as shuffles will be drastically reduced.
- Joins: Since similar data will belong to the same partitions, this will help in reducing the overall shuffle that is required in the process of Joins.
Now, let’s take a look at the following methods that can affect the partitions in a Spark job.
- Repartitioning: This method can be used to either increase or decrease the number of partitions in a Spark Job. Typically, whenever we call this method, the data is blindly shuffled across machines and allocated to new partitions. Generally, we use this method only when we need to increase the number of partitions.
- Coalesce: This method can be used to decrease the number of partitions in a Spark job. The difference between this method and the repartitioning method is that coalesce will try to minimise the movement of data across partitions. It avoids any unnecessary shuffle of data.
Let’s understand these two operations in detail with the help of the example given below:
As you can see in the example provided above, when we reduce the number of partitions with the help of coalesce, we retain the original partitions, Partitions A and C, and only the data from Partitions B and D is redistributed among Partitions A and C. This ensures minimal movement of data across the partitions.
If you use the repartitioning method in the same example, then you will see that all the values in the original partitions have been randomly redistributed into two partitions. Here, random shuffling will be done, and the values will be equally redistributed among the two resultant partitions.
As discussed earlier, Spark allows you to create your own partitioner using the mechanism of custom partitioner, where you can adjust the size and number of partitions created or the partitioning scheme according to the needs of your application.
In the next video, you will learn how to optimally choose the right partitioning criteria while building your custom partitioner.
As you saw in this video, you should try to make a partitioner where the data is more or less equally distributed among all the partitions. Also, the number of partitions should be greater than the number of physical cores present in the cluster to avoid underutilisation of the cluster resources.
In the next video, you will learn how to implement various partitioning techniques and how to create a custom partitioner. You will also understand how to use functions such as partitionBy(), repartition() and coalesce().
Note: You may get different results when you run the Jupyter Notebooks. This may be due to changes in network bandwidth and other internal reasons.
In this session, you learnt what data partitioning is and how proper partitioning techniques can help in reducing Network IO. You also learnt about the various operations that get benefited by using optimal partitioning techniques as well as some methods that can affect partitions themselves. Finally, you also understood how to implement custom partitioners and how to choose the right partitioning criteria.
Additional Content
- Repartition vs Coalesce – Link to a stack overflow article on the differences between repartitioning and coalesce
- Partitioning in Apache Spark – Link to a Medium article on partitioning in Spark
Report an error