IKH

The why of PCA

The first thing to know before learning anything new is to understand why and how that knowledge is useful. Hence, let’s start by understanding the motivation for studying PCA and then look at a brief overview of the technique and its applications.

Note:

At 3:05, for 100 variables we will need 4950 plots to visualise the associations, not 450.
As explained by Rahim, a couple of situations where having a lot of features posed problems for us are as follows:

  • The predictive model setup: Having a lot of correlated features lead to the multicollinearity problem. Iteratively removing features is time-consuming and also leads to some information loss.
  • Data visualisation: It is not possible to visualise more than two variables at the same time using any 2-D plot. Therefore, finding relationships between the observations in a data set having several variables through visualisation is quite difficult. 

Now, PCA helps in solving both the problems mentioned above which you’ll study shortly.
Let’s listen to the following lecture to understand the various applications of PCA.
Fundamentally, PCA is a dimensionality reduction technique, i.e., it approximates the original data set to a smaller one containing fewer dimensions(Note that dimension is just another term for referring to columns or variables in a dataset). To understand it visually, take a look at the following image.
In the image above, you can see that a data set having N dimensions has been approximated to a smaller data set containing ‘k’ dimensions. In this module, you will learn how this manipulation is done. And this simple manipulation helps in several ways such as follows:

  • For data visualisation and EDA
  • For creating uncorrelated features that can be input to a prediction model:  With a smaller number of uncorrelated features, the modelling process is faster and more stable as well.
  • Finding latent themes in the data: If you have a data set containing the ratings given to different movies by Netflix users, PCA would be able to find latent themes like genre and, consequently, the ratings that users give to a particular genre.
  • Noise reduction

Now attempt the following questions to test your understanding.

Report an error