Welcome to the session on ‘Optimising Disk IO for Spark’.
In the next video, you will get an overview of the topics that will be covered in this session.
In this session
You will understand some of the basic concepts of Disk IO and learn about the various optimisations that can be applied to reduce it in Spark jobs.
You will first learn how to spin up a Spark EMR cluster and understand the various concepts of Spark EMR cluster. You will then run a Spark job on an EMR cluster, Jupyter Notebook, which you will optimise at the end of this module.
Next, you will learn about the Anatomy of a Sparkjob and learn how the components of a job can be analysed in the Spark UI. You will understand the need for optimising a Spark job in the industry and also learn about the various approaches for optimising the Spark job.
You will learn about the first approach: Reducing the Disk IO. You will also learn about the various file formats and learn how choosing columnar file formats, such as Parquet and ORG,can be helpful in optimising the job performance.
In this session, you will also understand the concepts of Serialization and Deserialization and learn how this process can be optimised. Finally, you will learn about some Spark Memory management parameters, such as Persist, Cache and Unpersist, and learn how they can be used to optimise the Disk IO in Spark jobs.
People you will hear from in this session
Subject Matter Expert
Ajay Shukla
Senior Software Engineer, LinkedIn
Ajay is currently working as a senior software engineer at LinkedIn, an online employment-oriented platform. He has over nine years of experience in the IT industry and has worked in various companies, including Amazon, Walmart, Oracle and others. He has deep knowledge of various tools and technologies that are used today.
Report an error