IKH

Operations on Paired RDDs

In the previous segment, we looked at paired RDDs and some useful operations on them. In this segment, you will learn a few more operations that are useful in joining the two paired RDDs. These operations are join(), leftOuterJoin(), rightOuterJoin() and cogroup(). Let’s hear it from our SME Vishwa, in this upcoming video.

Note

Please note that in this module, you may sometimes see that the kernel is mentioned as Python 2 instead of PySpark. This is because some of these videos are older and the Python 2 kernel had the PySpark libraries installed already. For the current configuration of EMR, you will need to use the PySpark kernel only. The SME might also mention EC2 instance instead of EMR instance which is supposed to be in our case(At the most basic level, EMR instances make use of EC2 instances with additional configurations).

Let’s look at each of the operations discussed in the video above.

sortByKey(): This operator is used to sort the elements of a paired RDD. The sorting is done based on the key of the paired RDD.

join(): Whenever you do an inner join, the key must be present in both the paired RDDs; however, for an outer join, the key may or may not be present in both the paired RDDs.

Consider the following example:

Python
rdd1 = [('Spark', 50), ('Dataframe', 100), ('API', 150), ('Dataset', 120)]

rdd2 = [('Spark', 100), ('Dataframe', 120), ('RDD',150)]

Output

Now, if you apply the join operation on these two RDDs, the output is another RDD.

PowerShell
rdd3 = [('Spark', (50,100)), ('Dataframe', (100,120))]

In the upcoming videos, let’s look at some other operations.

rightOuterJoin(): rightOuterJoin() has the option to skip the key that is present on the left side of the operator; however, all keys that are present on the right side of the operator must be present. Let’s look at an example:

rdd1.rightOuterJoin(rdd2)

rdd1 is the operator on the left, and rdd2 is the operator on the right. All keys in the rdd2 must be present in the output. Let’s look at an example.

Example

Python
rdd1 = [('Spark', 50), ('Dataframe', 100), ('API', 150), ('Dataset', 120)]

rdd2 = [('Spark', 100), ('Dataframe', 120), ('RDD',150)]

The output is:

PowerShell
rdd3 = [('Spark', (50,100), ('Dataframe', (100,120)), ('RDD',(150, None))]

leftOuterJoin(): leftOuterJoin() has the option to skip the key that is present on the right side of the operator; however,all keys that are present on the left side of the operator must be present. Let’s look at an example.

rdd1.leftOuterJoin(rdd2)

rdd1 is the operator on the left, and rdd2 is the operator on the right. All keys in the rdd1 must be present in the output. Let’s look at an example.

Example

Python
rdd1 = [('Spark', 50), ('Dataframe', 100), ('API', 150), ('Dataset', 120)]

rdd2 = [('Spark', 100), ('Dataframe', 120), ('RDD',150)]

The output will be as follows:

PowerShell
rdd3 = [('Spark', (50,100), ('Dataframe', (100,120)), ('API',(150, None)), ('Dataset',(120, None))]

There is one more operator, cogroup(). Let’s understand this operator in the upcoming video.

cogroup(): In the case of cogroup(), if the key is present in any one of the RDDs, it will be present in the output.

As you have already understood transformation operations on RDDs, let’s see a few action operations as well.

The operations discussed in this video are:

  • countByKey(): count the number of elements for each key.
  • lookup(key): Finds all the values associated with the key provided.

Report an error