IKH

Why Optimise a Spark job?

In this module, you will learn about the various techniques to optimise Spark jobs, but first, you need to understand the need for optimising a Spark job.

In the next video, our SME will discuss the various reasons why Spark jobs need to be optimised in the industry. You will also learn about the key metrics that are important for optimisation and the various approaches to Job optimisation.

As you learnt in the video above, you can look at the need to optimise Spark jobs from multiple perspectives, which are as follow:

  • Execution time: At the industry level, optimising Spark job can reduce the execution time and, thereby, increase productivity. Generally, execution time affects the performance of Spark jobs; hence, this can prove to be of great advantage.
  • Resource utilisation: Since we do not have access to infinite resources, we have to ensure that the available resources are being utilised to there maximum efficiency.
  • Scalability: Data grows at an exponential rate in the industry, both in the terms of volume and velocity. So it is important that the Spark jobs are optimised in such a way that they are able to handle the increasing demand.
  • Maintainability: Typically, in the big data industry, most Spark jobs are not run only once and then set aside; they are supposed to be scheduled and reused on a periodical basis. Therefore, it is important to optimise Spark jobs to ensure maintainability so that even if there are some errors, the Spark jobs can be easily reused in the data pipeline.

Some of the key performance metrics that should be considered while optimising Spark are as follow:

  • Stages and tasks: As these are the basic units of work in a Spark job, it is important to closely monitor their performance. More number of Stages means more shuffling, which adversely affects the Network IO and, thereby, the overall performance of the Spark job.
  • RDD (memory imprint and usage): By monitoring the usage of memory and CPU utilisation, you can prevent OutOfMemoryError in your Spark job.
  • Spark environment information: Some of the metrics in the Spark environment, such as the different configurations and the root location of libraries, are useful for analysing the performance of a Spark job.
  • Detailed information about running executors: Executors are the actual nodes on which the Spark jobs are run. So, metrics such as the memory imprint of the nodes, CPU utilisation and the number of executors running can help in understanding important factors, including the degree of parallelism for the Spark jobs.

Now when it comes to the actual job optimisation, there are mainly two categories of optimisation that are usually performed, which are as follows:

  • Code-level optimisation: This includes techniques such as deciding how many partitions to make and what will be the partitioning size, which APIs to use for handling data and which methods to use.
  • Cluster-level optimisation: This includes techniques such as deciding the optimal number of machines, improving the utilisation of a cluster and preventing underutilisation.

In the next video, you will learn more about the two levels of optimisation discussed and also the techniques and approaches to be followed under both of these categories.

In the video above, you learnt about both levels of optimisation in Spark.

In code-level optimisation, you can improve Job optimisation in the following two ways:

  • Reducing Disk IO: Disk IO is one of the biggest challenges while optimising the performance of a Spark job. It is also one of the slowest processes in a Spark job. Avoiding unnecessary Disk IO will significantly improve the performance of a Spark job.
  • Reducing Network IO: Network IO is another process that becomes quite slow whenever shuffling is being carried out. This is because data gets transferred between different nodes. Reducing network IO will significantly improve the performance of your Spark job.

You will learn about many techniques for implementing the aforementioned code-level optimisations.

In cluster-level optimisation, you can optimise parameters such as executor memory and cores, which will directly improve the performance of the Spark job. you need to optimise these parameters in such a way that are neither too low, which may cause OutOfMemoryErrors, nor too high, which will lead to wastage and underutilisation of resources.

In the next video, our SME will run a Spark job on the EC2 instance that was used in the previous module and show how the OOM issue appears there.

The OOM issue appeared in the EC2 instance because we did not have enough resources in that machine. These issues are common while optimising the Spark Cluster parameters. You will learn more about this in the third session.

The link to the Jupyter Notebook used in this segment is given below.

Note:

 Please note that you may get different results when you run these Jupyter Notebooks. This may be due to Network bandwidth changes and other internal reasons.

In this segment, you understood the need for optimising Spark jobs, especially in the industry environment. You also learnt about some of the key metrics that you use while optimising Spark jobs. Lastly, you learnt about the various Job optimisation approaches and techniques, which you will further explore in the subsequent segments.

Additional Content:

  • An Article on Spark Optimization Techniques: Link to an article on the upGrad blog discussing some common Spark optimisation techniques
  • Things to check out when you face OOM issues: Link to a Stackoverflow article discussing some of the common solutions that you can implement when you face an OOM error in Spark

Report an error