IKH

Summary

Here’s a brief summary of what you learnt in this session!

Dimensionality reduction is a way of transforming a dataset having high number of features to a smaller dataset. A couple of situations where having a lot of features posed problems for us are as follows:

  • The predictive model setup: Having a lot of correlated features lead to the multicollinearity problem. Iteratively removing features is time-consuming and also leads to some information loss.
  • Data visualisation: It is not possible to visualise more than two variables at the same time using any 2-D plot. Therefore, finding relationships between the observations in a data set having several variables through visualisation is quite difficult. 

In simple terms, dimensionality reduction is the exercise of dropping the unnecessary variables, i.e., the ones that add no useful information. Now, this is something that you must have done in the previous modules. In EDA, you dropped columns that had a lot of nulls or duplicate values, and so on. In linear and logistic regression, you dropped columns based on their p-values and VIF scores in the feature elimination step.

PCA is one such dimensionality reduction technique, i.e., it approximates the original data set to a smaller one containing fewer dimensions. What PCA does is that it converts the data by creating new features from old ones, where it becomes easier to decide which features to consider and which not to.  To understand it visually, take a look at the following image.

In the image above, you can see that a data set having N dimensions has been approximated to a smaller data set containing ‘k’ dimensions. In this module, you will learn how this manipulation is done. And this simple manipulation helps in several ways such as follows:

  • For data visualisation and EDA
  • For creating uncorrelated features that can be input to a prediction model:  With a smaller number of uncorrelated features, the modelling process is faster and more stable as well.
  • Finding latent themes in the data: If you have a data set containing the ratings given to different movies by Netflix users, PCA would be able to find latent themes like genre and, consequently, the ratings that users give to a particular genre.
  • Noise reduction

As explained in the video above, PCA is a statistical procedure to convert observations of possibly correlated variables to ‘principal components’ such that:

  • They are uncorrelated with each other.
  • They are linear combinations of the original variables.
  • They help in capturing maximum information in the data set.

Report an error