User Pin Case Study

As we have discussed various operations on RDDs and solved some problem statements, let’s look at another problem statement. Here, we have the data of five different users. One file contains five categories that each user is interested in. Another file includes the new categories that the user has searched recently. You can download both of these files here.

There are two ways to put this data in HDFS.

Upload these files from your local file system to the livy directory present in HDFS on the EMR instance.

Download the files directly to the EMR instance with wget and then push it to the livy directory in HDFS.

The problem statement is to find new categories that each user is searching for recently.

Let’s watch this video where Kautak and Vishwa discusses one of the ways in which this problem statement can be solved.

Note

Please note that in this module, you may sometimes see that the kernel is mentioned as python2 instead of PySpark.This is because some of these videos are older and the python₂ kernel had the PySpark libraries installed already. For the current configuration of EMR, you will need to use the PySpark kernel only. The SME might also mention EC2 instance instead of EMR instance which is supposed to be in our case(At the most basic level, EMR instances make use of EC2 instances with additional configurations).

As discussed in the video, we create a paired RDD for the files in which the user is the key, and the category is the value. We used the intersection() method to find the categories that the users still search for and then subtract the result of the intersection() method from new search results to get only those categories that the user has searched recently.

The code used in this video is:

The jupyter notebook used in this video is:

We highly recommend you to apply more RDD operation to practice.

Report an error