In this segment, we will use Spark Dataframes to query an Amazon review dataset that contains product reviews by various customers on Amazon’s platform.
The dataset is hosted on an S3 bucket by upGrad, and you will need to copy this data to an S3 bucket on your AWS. To copy the data from the upGrad bucket, you need to login to your instance and then use the following command to copy the data.
Once the transfer is complete, you can check if the file has been transferred using the following command:
Please modify the path to the dataset in the code accordingly.
Let’s now analyse the Dataset using the Dataset using the DataFrame API in the next video.
Through this case study, we learnt that analysing a Dataset using DataFrames is better than using RDDs.
The Jupyter Notebook used in the video is: