IKH

Scree Plots

In the previous segment, you learnt how to perform PCA on your dataset and obtain the Principal Components. The final PCs that you got were as follows:

PowerShell
array([[ 0.52237162, -0.26335492,  0.58125401,  0.56561105],
       [ 0.37231836,  0.92555649,  0.02109478,  0.06541577],
       [-0.72101681,  0.24203288,  0.14089226,  0.6338014 ],
       [-0.26199559,  0.12413481,  0.80115427, -0.52354627]])

PC1 is given by the direction – [0.52  -0.26  0.58   0.56], PC2 by  [0.37 0.92 0.02 0.06] and so on. The principal components of the same number as that of the original variables with each Principal Component explaining some amount of variance of the entire dataset. This information would enable us to know which Principal Components to keep and which to discard to perform Dimensionality Reduction. 

Let’s understand it further in the following demonstration, where you’ll also come to know about scree plots and how they help in communicating the variance information very effectively.

Here’s a summary of the important steps that you performed :

  • First, you came to know how much variance is being explained by each Principal Component using the following code:
PowerShell
pca.explained_variance_ratio_
  • The values that you got were as follows:
PowerShell
array([0.72770452, 0.23030523, 0.03683832, 0.00515193])
  • The above values can be summarised in the following table:
Principal
Component
Variance explained
(in %)
PC172.8
PC223
PC33.6
PC4
0.5

So as you can see, the first PC, i.e. Principal Component 1([0.52  -0.26  0.58   0.56]) explains the maximum information in the dataset followed by PC2 at 23% and PC3 at 3.6%. In general, when you perform PCA, all the Principal Components are formed in decreasing order of the information that they explain. Therefore, the first principal component will always explain the highest variance, followed by the second principal component and so on. This order helps us in our dimensionality reduction exercise, as now we know which directions are more important than the others. 

Now, in our dataset, we only had 4 columns and equivalently 4 PCs. Therefore it was easy to visualise the amount of variance explained by them using a simple bar plot and then we’re able to make a call as to how much variance to keep in the data. For example, using the table above, you only need 2 principal components or 2 directions (PC1 and PC2) to explain more than 95% of the variation in the data.

But what happens when there are hundreds of columns? Using the above process would be cumbersome since you’d need to look at all the PCs and keep adding their variances up to find the total variance captured.

  • Using a Scree-Plot

An elegant solution here would be to simply add a plot of “Cumulative variance explained chart”. Here against each number of components, we have the total variance explained by all the components till then.

Principal  ComponentVariance Explained(in %)Cumulative Variance Explained (in %)
PC172.872.8
PC22395.8
PC33.699.4
PC4.599.9

So for example, cumulative variance explained by the top 2 principal components is the sum of their individual variances, given by 72.8 +23 =95.8 %. Similarly, you can continue this for 3 and 4 components.

If you plot the number of components on the X-axis and the total variance explained on the Y-axis, the resultant plot is also known as a Scree-Plot. It would look somewhat like this:

Now, this is a better representation of variance and the number of components needed to explain that much variance. 

Report an error