IKH

Spark jobs: Can We Do Better?

Now that you have learnt how to set up an EMR cluster and launch a Jupyter Notebook, you can start creating Spark jobs on the cluster.

Let’s start this module by first going through a Spark job. In this Spark job, we will use a Movielens data set and try to calculate the number of ratings for each movie. Here, we will need to output a table containing the count of ratings for a movie along with the movie name.

We also have the metadata of the data set, which contains the information for all the movie IDs and their respective movie names. In the next video, our SME will walk you through the PySpark code for solving this analytical query.

In the video above, you saw the Spark Notebook used for solving the analytical query specified above.

Now, in the subsequent segments of this module, we go through the various

concepts and techniques used for optimising Spark jobs.  At the end of this module, we will use these techniques on this particular Spark job and observe how much we are able to improve the execution time of the Spark job and optimise the cluster utilisation of our EMR cluster.

The link to the Jupyter Notebook used in this segment is given below.

Note: Please note that you may get different results when you run these Jupyter Notebooks. This may be due to Network bandwidth change and other internal reasons.

Report an error