Introduction to Spark RDDs

Resilient Distributed Datasets (RDDs) form the core abstraction of Spark. In this segment, you will understand the structure of an RDD. RDDs are special data types that are tailor-made for Apache Spark. The first boost to the performance of Apache Spark came from the innovative nature of the structure of an RDD. An RDD can be considered a distributed set of elements. Let’s hear about it from our expert Vishwa in the next video.

To summarise what you learnt in this video, let’s first take an example to understand the storage of RDDs.

Consider the following set of elements present in an RDD named ‘abc’.

All these elements form one RDD but are not stored in one place, i.e., these elements are distributed across the executors in different worker nodes. See the image below, which shows how the elements are spread across different nodes.

To summarise, RDDs are data structures that are designed to process big data effectively. Now, let’s take a look at the following image to understand the properties of RDDs.

The main properties of RDDs are as follows:

Distributed collection of data: RDDs exist in a distributed form over different worker nodes. This property helps RDDs to store large data sets. The driver node is responsible for creating and tracking this distribution.

Fault tolerance: This refers to the ability to generate RDDs if they are lost during computation. Intuitively, fault tolerance implies that if somehow an RDD gets corrupted (lost due to the volatility of memory), then you can recover the uncorrupted RDD. You will learn more about this in the next session.

Parallel operations: Although RDDs exist as distributed files across worker nodes, their processing takes place in parallel. Multiple worker nodes work simultaneously to execute the complete job.

Ability to use varied data sources: RDDs are not dependent on any specific structure of an input data source. They are adaptive and can be built from different sources.

These are basic RDDs. Another type of RDD called paired RDD, is used extensively. All the data items in a paired RDD are key-value pairs. The keys in a paired RDD are not necessarily unique. A paired RDD is created by transforming a basic RDD into a paired RDD using various transformation operations, such as ‘map’. You will learn more about these operations in the following sessions. Now, let’s take a look at the following image, which demonstrates an example of a paired RDD.

Additional Readings:

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing: An original paper by Matei Zaharia et al. from the AME lab at UC Berkeley describes RDDs. This paper proposed RDDs for the first time and implemented them in Spark in the Scala programming language.

What are APIs? explained in plain English: This is an article that explains APIs in an easy-to-follow language.

Report an error