In the previous session, you learnt about the basic data-handling and data-cleaning tasks that were essential to be performed. In this session,you will begin the journey with seaborn library and start extracting insights . Recall that the target variable for this case study is the rating column. The main task is to analyse this column and compare it with other variables to observe how the ratings change through different categories.
First, you’ll learn how to build a distribution plot for the ‘Rating’ column, which is pretty similar to the histograms that you saw earlier in matplotlib.
So, you have plotted a distribution plot to check the distribution of ratings using both the Matplotlib function and the Seaborn functions. In the latter case, you must have noticed that instead of the hist command, you are now using a distplot or a distribution plot.The corresponding Seaborn command is sns.distplot(inp1.Rating).
You can go through distplot’s documentation here to learn more about the various parameters that can be used. Notice that this view is quite different from the histogram plot that we had obtained earlier in Matplotlib.
The difference arises due to the fact that instead of calculating the ‘frequency’, the distplot in Seaborn directly computes the probability density for that rating bucket. And the curve (or the KDE as noted in the documentation for Seaborn) that gets drawn over the distribution is the approximate probability density curve.*
Coming back to the visualisation, the bars that get plotted in both the cases are proportional. For example, the maximum frequency occurs around the 4-4.5 bucket in the histogram plotted by matplotlib. Similarly, the maximum density also lies in the 4-4.5 bucket in the distplot.
The advantage of the distplot view is that it adds a layer of probability distribution without any additional inputs and preserves the same inter-bin relationship as in the Matplotlib version. This statistical view of things is also far more informative and aesthetic than the earlier one.
You are expected to go through the Seaborn documentation from the link given above and answer the following questions.
In the next video, you will learn about various customisations that can be performed in a seaborn distplot.
So, after changing the number of bins to 20, you were able to observe that most of the ratings lie in the 4-5 range. This is quite a useful insight, which highlights the peculiarities of this domain, as mentioned by Rahim. If people dislike an app, they don’t generally wait to give it bad ratings; rather, they go ahead and remove it immediately. Therefore, the average ratings of the apps are pretty high.
Also, you learnt about some more customisations that can be done on the same view. You can change the colour of the view and even use Matplotlib functionalities on top of Seaborn to make your graphs more informative.
Additional Notes
- *The terms “Probability Density” and “Probability Density Curve” may seem a bit alien to you right now if you do not have the necessary statistical background. But don’t worry, you will learn about them in a future module on Inferential Statistics. However, if you’re still curious, you can take a look at this link for further understanding.
- Another chart analogous to the histogram is the countplot. It essentially plots the frequency of values for a categorical variable. Basically, the values are the same as when you take a value_counts() for that variable. Take a look at its documentation to understand how it is implemented.
Now that you are reasonably proficient in creating a distplot and performing some basic customisations, in the next segment, let’s dive even deeper into the different ways in which you can customise a plot.