IKH

Scatter Plots

Previously, you had dealt with only a single numeric column and therefore used either a box-plot or a histogram to portray the insights visually. What about two numeric columns, say Rating and Size? If you want to plot the relationship between two numeric variables, you will be using something known as a scatter plot. In the following video, Rahim will explain the situations when a scatter plot can be used.

Scatter plots are perhaps one of the most commonly used as well one of the most powerful visualisations you can use in the field of machine learning. They are pretty crucial in revealing relationships between the data points and you can generally deduce some sort of trends in the data with the help of a scatter plot. 

The “Sales and Discount” example that you had seen earlier at the beginning of the module is an example of a scatter plot ( technically these are 4 different scatter plots, each of them showing a different city)

Applications of scatter plots in machine learning

Even though you’ll be learning about them in greater detail in future modules, it is good to know certain use cases where a scatter plot is immensely productive in the field of machine learning:

  • Observing trends between numeric variables– Because scatter plots can reveal patterns in the data, they’re a necessity in linear regression problems where you want to determine whether making a linear model, i.e. using a straight line to predict something makes sense or not. Check out the diagram given below.
  • Making a linear model between x and y makes complete sense in the first case rather than the second one.
  • Observing natural clusters in the data-In simple terms, clustering is the act of grouping similar entities to clusters. For example, let’s say you have a group of students who have recently taken a test in Maths and Biology. Plotting a scatter plot of their marks in the two subjects reveals the following view

You can clearly group the students to 4 clusters now. Cluster 1 are students who score very well in Biology but very poorly in Maths, Cluster 2 are students who score equally well in both the subjects and so on.

Now coming back to our problem, we’re discussing plotting the scatter plot between Rating and Size. You already know how to do this in matplotlib using pyplot.scatter() function. In seaborn, we have the sns.scatterplot() which is pretty intuitive and similar to its matplotlib counterpart. You are advised to go through its official documentation to get an understanding of how the various parameters work.

However, in this case, you’ll be using something called a JointPlot which combines the functionality of a scatter plot and also adds additional statistical information to it. Let’s watch the next video to understand this further.

You utilised the jointplot() functionality of seaborn to plot it and observed the following results:

Important Note

In newer versions of seaborn, the Pearson r and the p value metrics may not be visible since they have been deprecated. We’re suggesting the following workarounds. Please use them as per your seaborn version

Method 1 (before seaborn 0.11)

 You would have to annotate those values manually by importing the scipy.stats library and passing an additional parameter called stat_func in the jointplot code. Please check the code below to get a better understanding

Method 2 (before seaborn 0.11)

 Here’s a  StackOverflow answer that describes another similar way to achieve the same thing

Method 3 (seaborn 0.11 and above)

For seaborn versions 0.11.0 and above, the above two methods won’t work. Please use the following code snippet instead.

In addition to the normal scatter plot, the jointplot also adds the histogram of the respective columns to the mix as well. In this way, you can get an idea of the spread of the variables being discussed and therefore, make more succinct conclusions and gather insights from the data.

[Also, if you notice, there is the “Pearson r” and “p value” statistics information available to you as well. You’ll be learning more about them in an upcoming module.*]

The syntax of jointplot is pretty similar to both the scatter plot syntaxes from seaborn and matplotlib. Take a look at the official documentation to learn more about the parameters. 

The major insight that you got from the scatter plot is that there is a very weak trend between size and ratings, i.e. you cannot strongly say that higher size means better ratings. 

Regplots

When you are introduced to the seaborn library, it was mentioned that seaborn provides automatic estimation and plotting for regression setups for different kind of variables. Now regression would be dealt with in detail in future modules. However, it’s good to know how seaborn uses a modified version of the scatter plots, also known as regplots to achieve this for now. Let’s hear from Rahim as he explains this feature.

In the next segment, lets learn to plot multiple charts together.

Additional Notes

  • * In case you’re curious, Pearson’s r value is a metric to measure the correlation between 2 numerical entities. You can read more about it in the following link.
  • Scatter plots can show the trends for only 2 numeric variables. For understanding the relationships between 3 or more, you need to use other visualisations.
  • Here’s an article discussing the utilities of scatter plots.

Report an error