EDA on Data Set

In the previous segment, you explored the features of the given data set. The next step of the model building process involves performing EDA on the data set. In the upcoming video, Ajay will perform EDA on the data set.

As explained by Ajay in the video above, you can use the following command to read the data:

The path to your S3 may be different from the one mentioned above. Remember to add the correct path in your code.

Now, after loading the data, we only require the ‘artist_name’ and ‘plays’ columns in order to perform clustering on artists based on their popularity. So, you can select the required columns using the following command:

As explained in the video, one artist can be played by different users. So, in order to find the number of times that an artist’s songs were played by users, you need to calculate the sum for different users. For this, you can use the following command:

Note that the new DataFrame formed by the step mentioned above contains fewer columns than the previous DataFrame due to aggregation of features. After finding the total sum for different users, you need to use a vector assembler in order to transform the columns into vectors because the model inputs in PySpark consist of vectors. You use the following commands to implement a vector assembler:

Now, you can use ‘model_data’ as the final DataFrame for model building. In the next segment, you will learn about the model building process.

Report an error