You have now understood the various operations on basic as well as paired RDDs. You learnt how to apply various operations, their outputs and their use. Now, let’s apply these operations to solve the problem statement. The problem statement asks to find the number of words in a particular document.
You can use any text document, put it in the root user directory, and then put the file into HDFS and load it into Spark. If you want to use the document that our SME is using, you can download it here.
Now, let’s watch how Kautak and Vishwa solves this word count problem using PySpark.
Note:
Please note that in this module, you may sometimes see that the kernel is mentioned as Python 2 instead of PySpark. This is because some of these videos are older and the Python 2 kernel had the PySpark libraries installed already. For the current configuration of EMR, you will need to use the PySpark kernel only. The SME might also mention EC2 instance instead of EMR instance which is supposed to be in our case(At the most basic level, EMR instances make use of EC2 instances with additional configurations).
As discussed in the previous video, there are two ways to solve this problem.
Using Paired RDDs and applying reduceByKey() method.
- Create a simple RDD from the text file.
- Remove all empty lines using filter() method.
- Use flatMap() to make each word as one element of the RDD.
- Create a paired RDD from this basic RDD using map(lambda x:(x,1)).
- Now, use reduceByKey() method to add all values for a particular key.
Using Basic RDDs and using countByValue() method.
- Create a simple RDD from the text file.
- Remove all empty lines using filter() method.
- Use flatMap() to make each word as one element of the RDD.
- Now, use countByValues() method on this RDD, and it will return a dictionary of key/value pairs where the key is each word in the document and value is the number of times that word has appeared in that document.
The code used in this video for the first method is:
One more thing to note here is if we use words.countByValue().items(), we will get output here as a list of all the RDD elements.
The jupyter notebook used in this video is: