Best Practices While Working with Apache Spark

In the previous segment, you learnt how to run Spark jobs in the production environment. You also learnt about some of the factors that you should consider while running the jobs.

In this segment, you will learn about some of the best practices that you should follow while working with Apache Spark. In the upcoming video, our SME will discuss some of these best practices in detail.

Typically, there is no one configuration that can satisfy the requirements of all types of Spark jobs. Different types of workloads can have different processing patterns, and so, studying and analysing this processing pattern can help you achieve the best results in terms of performance and efficiency.

You can always start by using the default configuration, monitor your parameters and then tweak the parameters accordingly to achieve better performance results. You should not try to make drastic changes initially; instead, continue making incremental changes to improve the performance of your Spark job.

You also should consider the processing power of your Spark cluster and try to fully utilise its parallel processing capabilities. By doing this, you can avoid underutilisation of your machines.

You should try to keep the tasks of your Spark job small in size. This is because big tasks could result in a big Shuffle block, which may further lead to errors. Also, big tasks can have an adverse effect on the garbage collection process.

You should also remember that your Spark job is one of many other jobs that are running parallelly in your cluster (for example, monitoring systems and other jobs). Therefore, you need to ensure that you are not utilising all the resources of your cluster.

Some of the common error-and-solution approaches that you can use while working with Apache Spark are as follows:

Driver OOM: Increase Driver Memory.
Executor OOM: Increase Executor Memory.
Huge Shuffle block: Keep the size of the tasks small.
Frequent GC: Keep the size of the tasks small.

Note that the official Spark documentation is the most reliable resource that you can refer to whenever you need to recall any concept while optimising your Spark jobs.

In this segment, you learnt about some of the best practices that you should follow while running your Spark jobs.

Additional Content

Official Spark Overview documentation: Link to the official Spark documentation
Best practises while running Spark applications: Link to a Quora post describing some of the best practices for creating and running applications with Apache Spark

Report an error