In this segment, let’s summarise your learnings from this session. In the next video, our SME will briefly summarise all the different topics related to Spark Optimisation and the methods and techniques of reducing Disk IO covered in this session.
n the first segment, you got an overview of the Amazon EMR cluster and its features. You learnt how to deploy the cluster and launch Jupyter Notebook on the EMR cluster.
In the next segment, you wrote a Spark job and ran the job on the Spark EMR cluster. You will run this job again at the end of this module and also implement all the optimisation techniques that you learnt throughout this module. You will also compare the improvements in terms of processing time and resource utilisation.
In the next segment, you learnt about the different components and terminologies related to a Spark job. You also saw a demonstration on how to run a job and learnt about its different components on the Spark History Server UI.
In the next segment, you understood the need for optimising Spark jobs in the industry and learnt about some of the key performance metrics you should consider while optimising Spark. You also learnt about the two main approaches to optimising Spark jobs: code-level optimisation and cluster-level optimisation. In code-level optimisation, you learnt how to reduce Disk and Network IO for optimising Spark jobs. You also learnt about the common OOM issue that occurs when you run a job on the EC2 instance.
In the next segment, you understood the concept of Disk IO and why it is so expensive. You also learnt about the different techniques used for reducing Disk IO.
In the next segment, you learnt about the various file formats that you would generally use in a Spark job. You learnt about the Parquet and ORC file formats and also used them in your Spark job. You learnt how choosing a file format affects the performance of a job in the industry and also learnt about the benefits of using columnar file formats.
In the next segment, you understood the concepts of serialization and deserialization and their importance. You also learnt about Marshal Serializers and Pickle Serializers and how to use them with PySpark. You also learnt how to implement them in an actual Spark job.
In the last segment of this session, you learnt about the various memory levels that are supported by Spark. You also learnt about various techniques, such as Cache and Persist, which are used for saving intermediate RDDs and data sets in memory. Additionally, you understood the concepts of Checkpointing and Unpersist. Next, you learnt how to implement Persist, Cache and Unpersist in a Spark job and also learnt the difference in the performance of jobs between using and not using these techniques.
The PPT that was used throughout this session is attached below.
Report an error