In the previous sessions of this module, you learnt how to optimise Spark at the code level by optimising Disk IO and Network IO. In this session, you will learn how to optimise Spark jobs at the cluster level.
Before learning how to optimise cluster utilisation for Spark, you need to first understand why cluster utilisation is important.
In the upcoming video, our SME Vishwa will discuss the importance of cluster utilisation for optimising Spark jobs. You will also learn about some of the common mistakes that can lead to underutilisation or wastage of Spark clusters. You will also get an overview of the various techniques that can be used for optimising cluster utilisation.
In the video above, you understood the importance of optimising cluster utilisation for your Spark jobs.
In Spark clusters, you need to adjust the Drivers and Executors parameters while trying to optimise cluster utilisation. Both these components have their own respective memory. In addition, executors have their own set of cores on which tasks are run for Spark jobs. This directly affects the degree of parallelism of your Spark job. Therefore, you need to ensure that these parameters are aligned with the requirements of your workload.
So far, most of your optimisations were proactive in nature, which means that you were optimising your Spark job on your own. Sometimes, especially in the case of cluster optimisation, your optimisations might need to be reactive in nature. If your Spark job is failing because of an out-of-memory (OOM) error in your Driver logs, then it means that the final result of your Spark job, which is ultimately supposed to be sent to the Driver node, is much bigger in size than the Driver memory. To avoid this scenario, you will need to increase Driver memory. Similarly, if you encounter this error in your executor logs, then you may need to increase your Executor memory. You might also feel that the job is running quite slowly. In this case, you might need to increase the number of executor cores in order to increase the degree of parallelism and, thereby, increase the performance of your Spark cluster.
In the opposite scenario, you may tune your parameters to be higher than what is required. If your Driver memory is too high, it will lead to wastage of resources. Even though you might not face any out-of-memory issues, you could end up facing issues in other jobs because there might not be enough resources left for them. Similarly, your Executor memory may also be very high, which again will lead to wastage of resources and potential issues in other jobs.
The number of partitions in your Spark cluster should also be more than the number of executor cores in your cluster. This is to ensure that your Spark executor nodes are not underutilised in any scenario. However, you need to ensure that the number of partitions is not very high; otherwise, it will lead to stress on the Driver node, as the node will have to maintain the metadata of more partitions. A higher number of executor cores could also lead to wastage of resources.
Cluster utilisation can be optimised in the following ways
- The size of your Driver and Executor memory should always be based on the size of the data of the job. This will help in optimising the performance of your jobs while also preventing cluster underutilisation. Since every job can have different requirements, you cannot choose any one particular configuration for all your jobs.
- You should always try to strike the right balance between the performance benefits that you will get from your cluster configuration and the particular costs that you will have to incur for using these services.
- A standard practice that you should follow while choosing the optimal configuration for your Spark cluster is to start with the default configuration for your Spark job and then tweak the CPU and memory parameters accordingly. In most cases, your Spark job will perform well with the default configuration.
In this segment, you learnt why optimising cluster parameters is important while trying to optimise a Spark job. You also learnt some of the common mistakes that lead to underutilisation or wastage of resources and some of the best practices that you should follow to optimise cluster utilisation.
Additional Content
- Best practices for managing memory for Spark applications on EMR: Link to an article on the AWS blog detailing some of the best practices for successfully managing memory for Spark applications on Amazon EMR clusters
Report an error