In the previous session, you learnt about the concept of Disk IO along with the techniques that can help in optimising it. In this session, you will look at the concept of Network IO and understand how you can optimise it.
In the next video, you will learn what Network IO means and what are the operations behind it. You will also learn about the concept of data locality and understand why it is important to be considered when optimising Network IO along with some techniques to reduce Network IO.
In the video provided above, you learnt about the concept of Network IO.
Typically, Network IO can be attributed to the time that is taken when data has to be sent between monitored processes. This means that whenever the data present in, say, another remote machine or node is required for the current job, the Network IO operation will be performed to get it over the network. That is what Network IO is all about.
In Spark, Network IO occurs typically due to Shuffles. Spark is a distributed processing system; thus, data is distributed between a number of different machines and partitions. Shuffles refer to the phenomenon when data has to be rearranged between various partitions. Shuffle is a movement of data across multiple partitions (which will mostly be sitting on a different node). Now since you know that Shuffles is the cause behind the increase in Network IO, let’s look at the following ways to reduce it:
- Optimising joins: Shuffles usually happen whenever a join operation occurs. A join is an extremely expensive operation in data storage systems. In Spark, whenever a join occurs, larger sets of data such as entire tables are moved across various partitions; thus, optimising techniques for joins directly help in reducing Shuffles.
- Avoiding wide transformations: Many wide transformation operations such as groupByKey() and sorting involve the movement of a large amount of data across various partitions. If you can avoid such operations, then you can avoid Shuffles.
In any big data processing system, the overall volume of data is extremely big; thus, instead of bringing the data towards the nodes where the algorithms and codes are, you should try and bring the code to the partitions and nodes where the data is actually present. This concept is known as data locality. If you try implement data locality in your Spark jobs, then it will try to keep the IO operations within a single physical node, which means that you are avoiding any IO operations over the network. The following image demonstrates what data locality exactly means.
As there are various techniques to reduce Disk IO, there are several techniques to reduce Network IO as well. Most of these techniques directly affect Shuffles and thereby help in reducing the overall Network IO. These techniques are as follows:
- Optimising partitioning techniques: Using proper partitioning techniques will help you avoid any unnecessary Shuffles in your Spark job. In the case of joins, if you use proper partitioning techniques, then the similar data from two tables might lie in the same machine, thereby reducing the number of shuffles required.
- Avoiding wide transformations: Wide transformations are a kind of transformation function that lead to the movement of data across several partitions. For example, if you want to perform a groupByKey() operation, then all the different RDDs having the same key have to travel and come to a single machine. This entire operation requires a lot of travel and come to a single machine. This entire operation requires a lot of shuffling of data. Hence, if wide transformations are strictly required, then you should use alternative operations such as reduceByKey(), which will try to reduce the data at the individual partitions before the actual shuffle happens.
- Using broadcast joins: Joins: are extremely expensive operations, as a large amount of data has to be shuffled across partitions to ensure that the data records with the same keys are in the same machines so that the actual join operation can be performed. In some cases where one table may be comparatively smaller table to all the partitions, thereby reducing shuffling effectively.
In this segment, you learnt about the concept of Network IO and the operation behind it, i.e., Shuffles. Then, you learnt how you can reduce Shuffles and also looked at the concept of data locality. Finally, you learnt briefly about the various techniques that can help in optimising Network IO. In the following segments, you will get a more precise idea of what Shuffles mean and also gain an in-depth understanding of the various techniques to optimise Network IO.
Additional Content
- Data Locality – Link to an article on what data locality means with reference to Spark
Report an error