Now, let’s summarise your learnings from this session. In the next video, our SME will briefly summarise all the different topics related to optimising Network IO that you learnt about in this session.
In the first segment, you learnt what Network IO actually is and how Shuffles affect Network IO. You also learnt about the concept of data locality and understood how it can help in reducing Network IO. Then, you were briefly introduced to some of the techniques that can help in reducing Network IO.
Then, in the following segment, you learnt about the concept of Shuffles in detail. You also looked at the different operations that cause shuffling of data as well as some techniques that can be used to reduce shuffling. You also learnt about the difference between using groupByKey() and reduceByKey() operations by implementing both in the same Spark job.
In the next segment, you learnt how you can optimise joins and how Joins affect Network IO. You also looked at the different types of joins that are available in Spark along with Shuffled Hash Join and Broadcast Join and also understood how Broadcast Join improves upon Shuffled Hash Join. Then, you learnt how to implement both these types of joins in a Spark Job.
In the next segment, you learnt about the concept of data partitioning and understood how proper and optimal partitioning techniques can help in reducing Network IO. You also looked at the different operations that are benefited by using optimal partitioning techniques as well as some methods that can affect the partitions themselves. Then, you learnt how you can implement custom partitioners in a Spark job and how you can choose the right partitioning criteria for making your own custom partitioners. Finally, you learnt how you can use functions such as partitionBy(), repartition() and coalesce().
The PPT that was used throughout this session is provided below.
Report an error