In the previous segment, you learnt about RDDs and how to create them using different methods. In this segment, you will learn about the various operations on RDDs.
RDDs are low-level APIs used by Spark to abstract information on how data is partitioned across various devices in a cluster. Since you can store data in RDDs, you can use multiple operations to manipulate and analyse data stored in RDDs. In the next video, let’s hear from Vishwa Mohan, as he explains what these operations are.
The first type of operation is called a transformation. You can create new RDDs from existing RDDs by applying transformations. For example, to get an RDD containing only those lines that have the word ‘spark’, you can apply a filter transformation to the original RDD. This will result in a new RDD, where only the lines with the term ‘spark’ are present. However, at the end of the transformation, you will not receive the filtered output in the driver or the storage system.
To generate the output of the transformations, you need the second type of operation, that is, action. An action usually involves operations, such as counting and saving to a disk. Actions trigger all the transformations behind them. This is where the concept of lazy evaluation comes into the picture. We will discuss this concept in the following segments.
Some examples of transformations are as follows:
- map(): It runs a function over each element of an RDD.
- flatMap(): It runs a function where the output of each element may not be a single element.
- groupByKey(): It is used to group data in a key-value pair.
- union(): It takes the set union of two RDDs.
Some examples of actions are as follows:
- count(): It counts the number of elements in an RDD.
- collect(): It persists an RDD to a local machine.
- reduce(): It reduces the elements through an aggregation function. The most common type of reduce function is a sum.
In this segment, you got an overview of the various operations on Spark RDDs. Now in the next segment, you will learn about the various Transformation operations in more detail.
Additional Reading:
RDD Operations:Link to RDD operations in the official Spark RDD programming documentation.