Paired RDDs

Paired RDD is a special RDD class that holds the data as (Key, Value) pair. Due to the difference in structure, there are various operations associated with Paired RDDs. Let’s take a look at them in this video.

Note

Please note that in this module, you may sometimes see that the kernel is mentioned as Python 2instead of PySpark. This is because some of these videos are older and the Python 2 kernel had the PySpark libraries installed already. For the current configuration of EMR, you will need to use the PySpark kernel only. The SME might also mention EC₂ instance instead of EMR instance which is supposed to be in our case(At the most basic level, EMR instances make use of EC₂ instances with additional configurations).

Note

The notebook discussed in this session is the same as of the previous session.

As discussed in the video, you can easily create a paired RDD using the parallelize() function as the base RDDs.

Example

PowerShell

rdd = sc.parallelize([('a', 3), ('b', 4), ('c', 2), ('d', 5)])

rdd.collect()

Output

You can also use the following methods to create the paired RDDs:

textFile() method.

Transforming base RDDs.

The limitation with textFile() function is that the structure of the file must be in<key ,value> a format to load the data directly in the form of paired RDDs.

Another method to create paired RDDs is by transforming the base RDDs. Transformations like map() and flatMap() can be used to convert the base RDD into paired RDD.

Example

Python

rdd1 = sc.parallelize([1,2,3,4,5])

rdd2 = rdd1.map(lambda x:(x,1))

rdd2.collect()

Output

reduceByKey() and groupByKey() are two important methods on the paired RDDs. Let’s understand both of them one by one:

There are two important methods, reduceByKey() and groupByKey(), in paired RDDs. Let’s understand both of them one by one.

reduceByKey(): reduceByKey() method is used to perform a particular operation on the elements of RDDs. Let’s look at an example.

Example

Python

rdd1 = sc.parallelize([('rdd',50),('Spark',100),('API',100),('Spark',150),('API',50)])

rdd1.reduceByKey(lambda x, y:x+y).collect()

Output

Here, x and y are any two values, which are to be added, for a particular key. Let’s look at the output of this operation.

[(‘rdd’,₅₀), (‘Spark’,₂₅₀), (‘API’,₁₅₀)]

We can see that reduceByKey() will perform the defined operation on the values of a particular key.

Let’s understand this operation with another example.

Pro Tip

The reduce() method in basic RDDs is an action operation that returns a result to the driver program. However, reduceByKey() is a transformation operation that results in a new paired RDD with a key/value pair where value is now an aggregated result for that particular key.

groupByKey(): groupByKey() performs the same operation as reduceByKey(), but it creates an iterable for the values for a particular key. groupByKey() involves a lot of shuffling as it does not combine the keys present in the same executor. Let’s look at an example to understand this.

In the next video, let’s look at more operations that are available on paired RDDs.

Let us look at the operations discussed in this video.

mapValues(): This function is used for operating on the value part of the key/value pair. Let’s look at an example.

Example

Python

rdd1 = sc.parallelize([('a',5),('b',10),('c',15),('a',15),('c',20)])
rdd2 = rdd1.mapValues(1ambda x:x*2)
rdd2.collect()

Output

PowerShell

[('a',10),('b',20),('c',30),('c',40)]

mapValues() example

flatMapValues(): To understand flatMapValues(), let’s look at an example.

Example

Python

rdd1 = sc.parallelize([('a',5),('b',10),('c',15),('a',15),('c',20)])
rdd2 = rdd1.flatmapValues(1ambda x:(x,5))
rdd2.collect()

Output

PowerShell

[('a', 5),
 ('a', 5),
 ('b', 10)'
 ('b', 5),
 ('c', 15)'
 ('c', 5),
 ('a', 15)'
 ('a', 5)'
 ('c', 20),
 ('c', 5),

flatMapValues() example

keys(): The keys() function creates a new RDD that contains only the keys from the paired RDD.

Example

Python

rdd1 = sc.parallelize([('a',5),('b',10),('c',15),('a',15),('c',20)])
rdd2 = rdd1.keys()
rdd2.collect()

Output

PowerShell

['a', 'b', 'c', 'a', 'c']

keys() example

values(): The values() function creates a new RDD that contains only the values from the paired RDD.

Example

Python

rdd1 = sc.parallelize([('a',5),('b',10),('c',15),('a',15),('c',20)])
rdd2 = rdd1.values()
rdd2.collect()

Output

PowerShell

[5, 10, 15, 15, 20]

values() example

By now, you have understood some basic transformation and action operations on paired RDDs. In the upcoming segments, you will learn about various other operators and solve some problem statements using paired RDDs.

Report an error