Welcome to the session on ‘Paired RDDs’. The previous session was based on the basic concepts of Apache Spark RDDs. You have learnt about different components in Spark and understood the differences between Spark and MapReduce. You were also introduced to the core abstraction of Spark – the RDDs.
In this session, we will understand coding Spark programs in RDD, the Spark Core data structure. This session is filled with code demos that explain the function of each operation on RDDs and case studies that will help you understand how to use those operations to analyse data.
This session will build on the programming aspect of Spark. You will learn how to process data in Spark paired RDDs using the PySpark API. This session covers:
- Creating paired RDDs.
- Operations on paired RDDs.
- Solving problem statements.
The objective of this session is to give you hands-on experience of working on Spark’s computation engine. While running this code, if you encounter errors, try debugging them on your own as it will give you valuable experience for handling Spark jobs.
The codes provided in the video must be typed and implemented. The explanations have been provided along with the videos in the text along with the additional exercises. Make sure that you implement those for better understanding.