In the previous two segments, you understood how to apply PCA on a dataset followed by the importance of scree-plots. Now that you know how many principal components you need to explain a certain amount of variance, let’s go and finally do dimensionality reduction on our dataset using the Principal Components that we’ve chosen.
Here’s a summary of the important steps that you saw above:
- Choosing the required number of components
- From the scree plot that you saw previously, you decided to keep ~95% of the information in the data that we have and for that, you need only 2 components. Hence you instantiate a new PCA function with the number of components as 2. This function will perform the dimensionality reduction on our dataset and reduce the number of columns from 4 to 2.
pc2 = PCA(n_components=2, random_state=42)
- Perform Dimensionality Reduction on our dataset.
- Now you simply transform the original dataset to the new one where the columns are given by the Principal Components. Here you’ve finally performed the dimensionality reduction on the dataset by reducing the number of columns from 4 to 2 and still retain 95% of the information. The code that you used to perform the same step is as follows:
newdata = pc2.fit_transform(x)
and the new dataset is given as follows:
- Data Visualisation using the PCs
- Now that you have got the data in 2 dimensions, it is easier for you to visualise the same using a scatterplot or some other chart. By plotting the observations that we have and dividing them on the basis of the species that they belong to we got the following chart:
As you can see, you clearly see that all the species are well segregated from each other and there is little overlap between them. This is quite good as such insight was not possible with higher dimensions as you won’t be able to plot them on a 2-D surface. So, therefore, applying PCA on our data is quite beneficial for observing the relationship between the data points quite elegantly.
Important Note: When you perform PCA on datasets generally, you may need more than 2 components to explain an adequate amount of variance in the data. In those cases, if you want to visualise the relationship between the observations, choose the top 2 Principal Components as your X and Y axes to plot a scatterplot or any such plot to do the same. Since PC1 and PC2 explain the most variance in the dataset, you’ll be getting a good representation of the data when you visualise your dataset on those 2 columns.