In the previous segments, you saw how to perform dimensionality reduction using PCA and then immediately were introduced to one of its key applications which is for data visualisation. However, the most common application of PCA is to improve your model’s performance. So in real life, you use PCA in conjunction with any other model like Linear Regression, Logistic Regression, Clustering amongst others in order to make the process more efficient. In the following demonstration, you’ll be looking at both the scenarios – performing model building without PCA and then with PCA to appreciate how much faster it is to get similar or better results in the latter case.
Overview of the Demo
For this demonstration, our main model will be a logistic regression setup. As mentioned above, first we’ll be performing Logistic Regression directly without any PCA. For this demo, we’ll be using the Telecom Churn dataset that you have worked earlier with.
Model Building without PCA
Since you’re already familiar with the data and the logistic regression model that you built, here’s a quick walkthrough to refresh your memory.
Video Correction: At 03:52, Rahim says ‘linear regression’ though he meant ‘logistic regression’
In the video below, we will aplly PCA on the data and visualise it by the transformations done by the PCA.
You saw the process of building a churn prediction model using logistic regression. Some important problems with this process that Rahim pointed out are:
- Multicollinearity among a large number of variables, which is not totally avoided even after reducing variables using RFE (or a similar technique)
- Need to use a lengthy iterative procedure, i.e. identifying collinear variables, using variable selection techniques, dropping insignificant ones etc.
- A potential loss of information due to dropping variables
- Model instability due to multicollinearity
If you remember the first session, we discussed all these points as potential issues that plague our model building activity. Now let’s go ahead and perform PCA on the dataset and then apply Logistic Regression and see if we get any better results
Model Building with PCA
In the second part, first, we’ll reduce the dimensions that we have using PCA and then create a logistic regression model on it.
As you could see, with PCA, you could achieve the same results with just a couple of lines of code. It will be helpful to note that the baseline PCA model has performed at par with the best Logistic Regression model built after the feature elimination and other steps.
PCA helped us solve the problem of multicollinearity (and thus model instability), loss of information due to the dropping of variables, and we don’t need to use iterative feature selection procedures. Also, our model becomes much faster because it has to run on a smaller dataset. And even then, our ROC score, which is a key model performance metric is similar to what we achieved previously.
To sum it up, if you’re doing any sort of modelling activity on a large dataset containing lots of variables, it is a good practice to perform PCA on that dataset first, reduce the dimensionality and then go ahead and create the model that you wanted to make in the first place. You are advised to perform PCA on the datasets that you worked on in Linear Regression and Clustering as well, to see how it makes our job easier.