Welcome to the module on ‘Optimising Spark for Large-Scale Data Processing’ .
In the previous module, you learnt about Apache Spark and its ecosystem. You also understood the various concepts of Spark, such as Spark RDDs and Spark Dataframes, and learnt how to use them to create Spark programs. Spark is a much faster big data processing framework than MapReduce because it focuses on in-memory computations. However, you can still do a lot of optimisations while writing Spark programs to not only increase job performance but also optimise the memory used by the jobs.
In the next video, our SME Vishwa will provide you with a brief introduction to the topics that will be covered in this module.
In this module, you will first learn about a Spark job, which we will discuss later at the end of the module, and try to apply all the optimisation techniques that you will learn in this module.
Session 01
. In the first session, you will learn how to spin up a Spark EMR cluster and understand its various concepts.
.You will then run a Spark job on the EMR cluster Jupyter Notebook, which you will try to optimise at the end of the module using the concepts covered in this module.
. Nest, you will learn about the Anatomy of a job and understand the need for optimising a Spark job in the industry, You will also learn about the various approaches for achieving this. You first learn how to optimise the Disk IO. You will understand important concepts and techniques such as optimising file formats, Serialization and Deserialization and Spark Memory management parameters, including Persist, Cache and Unpersist and learn how they can be used to optimise the Disk IO in Spark jobs.
Session 02
.In the second session, you will learn how to Optimise Network IO.
In this session, you will first understand the concept of shuffles and learn how to reduce it. You will then learn about the various techniques such as using the ReduceByKey instead of the GroupByKey, Optimising joins, Optimising partitioning and using Custom Partitioners to reduce Network IO.
Session 03
. In the third session, you will learn how to optimise the Spark Cluster configuration. In this session, you will first understand why this is important and then learn how to avoid under- utilisation of the cluster.
. Next, you will learn about the various job Deployment modes in Spark. You will learn about the various parameters that you can optimise for a Spark cluster and also learn how to maintain an appropriate Cost and Performance trade-off.
. Finally, you will learn about some of the best practices that are followed in the industry for deploying Spark and while working with Spark.
. At the end of this module, you will optimise the Spark job that you will write in this module using the Spark Optimisation concepts covered throughout this module.
Guidelines for this module
. As this module contains a lot of coding-based content, it is advised that you keep practising the various commands used throughout this module and actively attempt the in-segment questions as well.
. Also, note that you need to use the EMR cluster carefully while using the different optimisation techniques because EMR is a costly service and you might overshoot your budget if the EMR cluster is kept running for a long time.
. Please do read the lab documen
- As this module contains a lot of coding-based content, it is advised that you keep practising the various commands used throughout this module and actively attempt the in-segment questions as well.
- Also, note that you need to use the EMR cluster carefully while using the different optimisation techniques because EMR is a costly service and you might overshoot your budget if the EMR cluster is kept running for a long time.
- Please do read the lab documents for all the coding sessions and refer to them whenever you have any doubt regarding the PySpark commands used throughout this module.
- The theory videos in this module are presentation-based, and the presentation used in each session is provided in the corresponding Session Summary segment.
- The lecture notes for this module will also be included at the end of this module in the last segment of the session on Optimising Spark Clusters.
Guidelines for in-segment and graded questions
There will be a separate session for graded questions. The other sessions will contain non-graded questions. The graded questions in this module will each have 10 marks for a correct answer and 0 for an incorrect answer. Each graded question will allow only one attempt, whereas non-graded questions may allow one or two attempts depending on the question type and the number of options.
People you will hear from in this module
Subject Matter Expert
Senior Software Engineer, LinkedIn
Vishwa is currently working as a senior software engineer at LinkedIn, an online employment-oriented platform. He has over nine years of experience in the IT industry and has worked in various companies, including Amazon, Walmart, Oracle and others. He has deep knowledge of various tools and technologies that are used today.
Report an error