At the of this module, we ran a Spark job on the EMR cluster. We had discussed that we would run the job again at the end of the module after learning about the different types of optimisation techniques. Now that we have gone through the entire module, we can start applying the different optimisations in the Spark job.
In the next video, our SME will provide an overview of the various optimisation techniques that we used for optimising the Spark job.
As discussed in the video above, we applied various optimisation techniques to reduce the overall execution time of the Spark job.
- We tried to implement caching techniques to reduce the overall Disk IO of the program using the DataFrames that we used throughout the Spark job, such as movie_rating_count.
- We repartitioned both the major tables in our Spark job on the movie_id column, as most of the data for the same key will lie in the same partition.
- We also implemented broadcast join to join the movie_ratings table with the movie_name table to further reduce the Network IO, as the movie_name table is a comparatively small table.
- We stored the final result in the ORC format, instead of the typical JSON format. This helps in reducing the Disk IO drastically.
Now, let’s watch the next video and see how the Spark job runs with these optimisations.
As you saw in the video above, after implementing the optimisations in the Spark job, we were able to speed up the execution of the Spark job by a huge margin.
The table given below shows the execution time at all the different stages of both the Spark jobs.
Note that this table was created after executing both the Jupyter Notebooks on a fresh EMR cluster. Also, we ran the notebook for the final job run twice since the first time, it has to cache data as well due to these being lazy operations.
The link to the Jupyter Notebook used in this segment is given below
Note: You will probably get different results when you run the Jupyter Notebooks. This may be due to network bandwidth changes and other internal reasons.
In this segment, we finally ran our initial Spark job with the appropriate optimisations that we learnt about throughout this module.
Report an error