In the previous segments, you learnt how to tune the various components of the Spark cluster memory and CPU parameters using the Spark Submit command.
In this segment, you will learn how to maintain a balance between cost and performance trade-offs by utilising a Spark cluster with the optimal configuration.
For this, you will be running the Spark job in the reviews.py code file, which you ran in the previous segment. Let’s watch the next video to learn more about this.
Now that you know what your Spark job looks like, let’s run this job on the Spark cluster using spark-submit. In the next video, our SME will run the Spark job on a 3-node cluster machine.
As you saw in the video above, we were able to run the job on the 3-node cluster machine using spark-submit. Here, we used the default spark-submit command and did not change the various parameters, such as executor memory and executor cores, so as to analyse the performance of the different cluster machines in their default configuration. You can tweak these parameters to further optimise the job if you wish to do so.
In the next video, our SME will walk you through the different execution times and explain the total cost of the machine per hour, for each of the three configurations that we chose for our Spark cluster. This will give you an idea of how to optimise the cost-performance trade-offs for your own Spark jobs.
In the video above, we were able to analyse how choosing a specific cluster configuration for the Spark job affects the execution time. You also got an idea of how much the machine would cost per hour.
First, we ran the job on a cluster of three m4.large machines, out of which one machine was the driver node and the other two machines were the worker nodes. Each machine costs around $0.1 per hour. Considering we are running three machines, the total cost would be around $0.3 per hour.
Note that there may be some additional costs involved in the creation of the cluster. For the purpose of this analysis, we are strictly focussing on the cost of the running machines for each hour of its use. We saw that the job took around 1.8 minutes for execution.
Next, we used a cluster for which we had a total of five machines, instead of three, with the same m4.large configuration. Here, we saw that the job was completed in about 1.3 minutes only.
Finally, for the last cluster, we used a higher configuration machine, m3.xlarge, which had twice the number of virtual CPUs as the m4.large configuration. Each of these machines cost around $0.27 per hour, so the total cost would be approximately $0.8 per hour for the entire cluster. In this configuration cluster, the execution time for the job was 1.5 minutes.
As you must have observed, even though the m3.xlagre cluster had twice the vCPUs per machine as the m4.large cluster, it was slower and about 60% more expensive than the 5-machine m4.large cluster. Therefore, we can conclude that horizontal scaling yielded better results in terms of performance and cost.
The table given below shows the different cluster configurations along with their hourly total cost and the execution time for our Spark job.
It is important to note that every job may be different and you should always consider cost and performance trade-offs while trying to find the most efficient configuration for your use case.
In this segment, you learnt how to do a cost-to-performance trade-off analysis by running the same job in three different configurations of EMR clusters.
Additional Content
- Official AWS EC2 pricing: Link to the official AWS EC2 pricing for the different cluster configurations. The EMR clusters are also based on this pricing for the different configurations.
Report an error