Tuning Spark Memory and CPU Parameters

In the previous segments, you understood the need for optimising cluster utilisation for Spark jobs.

In this segment, you will learn how to tune the Spark memory and CPU parameters. You will learn about the various configuration parameters that you need to carefully set for optimising the performance of your Spark job. In the upcoming video, our SME will discuss the various parameters in Spark and show how to tune them.

In the video above, you learnt about the Spark memory model. For an executor, the memory model has set 10% for overhead memory and has divided the remaining 90% into Spark Memory, User Memory and Reserved Memory. The diagram depicting the different components of the memory model is given below.

Spark Memory is further divided into Execution Memory and Storage Memory. Execution Memory is the memory where the execution of tasks, shuffles and other wide operations happens and is stored. Storage Memory is the memory that stores cache and broadcast tables.

User Memory is the memory that is fixed for storing data structures and metadata, and for safeguarding against out-of-memory issues.

Reserved Memory is used for running the actual Executor node.

Spark Execution Memory is the main memory because it decides the storage for actual task executions and operations. Storage memory is also important as it stores cache and broadcast tables.

In Spark, you can tweak various configuration parameters. Some of the important ones are as follows:

spark.executor.memory: This is the size of memory to be used for each executor that runs the task.
spark.executor.cores: This indicates the number of virtual cores.
spark.driver.memory: This is the size of memory to be used for the driver.
spark.driver.cores: This indicates the number of virtual cores to be used for the driver.
spark.executor.instances: This indicates the number of executors and is to be set unless spark.dynamicAllocation.enabled is set to true.
spark.default.parallelism: This indicates the default number of partitions in RDDs returned by transformations such as join, reduceByKey and parallelize when no partition number is set by the user.

These parameters can be set in multiple ways. You can pass these parameters in the program itself bypassing the configuration during the Spark context initialisation in SparkConf(). You can also pass these configuration parameters in the spark-defaults.conf file, which will basically override the default parameters. Another way (easiest) to pass these parameters is to pass them and run the job using the spark-submit command.

In the next video, you will learn how to submit Spark jobs into any cluster (for example, the EMR cluster) using the spark-submit command.

Note

In Cloudera CDH, you need to use spark2-submit. All the other keywords and parameters will be the same as those covered in the video above.

As you learnt in the video above, in order to submit a job, you need to use a utility known as spark-submit, which contains various configurable options such as executor memory and executor cores.

You need to set these parameters in such a way that the other jobs that are running in your Spark cluster are not affected in any way, and at the same time, you get the best performance for the Spark job that you are trying to run.

Note

The link to the spark-submit documentation has been provided in the Additional Content section.

Now let’s take a look at the spark-submit command on the EMR cluster.

You can use the command given below to understand how the spark-submit command works and the different options that it offers.

Tuning Spark Memory and CPU Parameters In Cloudera CDH, you need to use spark2-submit. All the other keywords and parameters will be the same as those covered in the video above.

As you learnt in the video above, in order to submit a job, you need to use a utility known as spark-submit, which contains various configurable options such as executor memory and executor cores.

Note: The link to the spark-submit documentation has been provided in the Additional Content section.

Now let’s take a look at the spark-submit command on the EMR cluster.

You can use the command given below to understand how the spark-submit command works and the different options that it offers.

Example

Python

spark-submit -h

Output

In the next video, you will learn how to use spark-submit to tune the different parameters of your Spark job.

In the video above, you learnt how to use spark-submit to run your job and tune the different parameters according to your requirements.

As discussed previously and also in the video above, you should always consider both the total available resources that you have and the other jobs that are already running on your Spark cluster, while tuning the cluster parameters for your Spark job.

You can use the template provided below to change the different parameters using spark-submit for your Spark job.

Example

Python

spark-submit --deploy-mode < cluster or client, if master is yarn>  --master yarn --num-executors <n> --executor-memory <ng> reviews.py

Output

Note:

You can also try to use the master as local to run the job on the single local machine.

When we ran the Spark job for the first time, it showed an error message stating that the executor memory was set higher (10 GB) than the maximum threshold of the cluster (6,144 MB).

When we ran the job again with the reduced executor memory (5 GB), the job started executing, but eventually, it failed. This is because we had selected the number of executor cores as 5, which is more than the maximum number of virtual cores in the cluster, i.e., 4.

When we ran the job for the third time with three executor cores, the job was successfully completed.

By checking the Spark History Server UI, we learnt that the job took around 2.4 minutes to complete.

You should always try to run your Spark job using the default parameters set in spark-submit first. This is because most of the time, the default parameters are quite well-optimised for running almost all your Spark jobs efficiently. To do this, you can simply remove all the options from your spark-submit command.

After we ran the job with default parameters, we saw that the job was completed within 1.7 minutes. As you can observe, in most circumstances, the default parameters are satisfactory and, sometimes, even more, optimised than what you can achieve by manually tuning your Spark job.

Note that you can always use the “spark-submit -h” command, which was discussed previously, to check the various parameters that you can tweak. Also, the AWS EMR documentation on spark-submit contains various formulas that you can use to evaluate the various options available in spark-submit if you want to further tweak your job.

In this segment, you learnt about the various Spark cluster parameters. You also learnt how to use the spark-submit command to optimise and tweak the various cluster parameters for your Spark job.

Additional Content

Official AWS EMR spark-submit documentation: Link to the official AWS EMR Spark Submit documentation page

Report an error