The next concept that is crucial for understanding how clustering generally works is the idea of centroids. If you remember your high school geometry, centroids are essentially the centre points of triangles. Similarly, in the case of clustering, centroids are the centre points of the clusters that are being formed.
Now before going to the formula part, here is an intuition for the need of a centroid. Imagine you have the following clusters of the marks of a group of students in Mathematics and Biology and someone asks you to explain them. From a glance, you can easily interpret the 4 clusters that are being formed.
So the four clusters that are being formed are as follows:
- Cluster 1: Students who have scored high marks in Bio, but poor marks in Maths.
- Cluster 2: Students who have scored average marks in Bio and Maths.
- Cluster 3: Students who have scored high marks in both Bio and Maths.
- Cluster 4: Students who have scored high marks in Maths, but poor marks in Bio.
Now the above representation is fine and correct, but it is missing one crucial information – the numerical order. For example, when you want to compare two clusters say Cluster 1 and Cluster 2 can you say by how much marks on average do the students from Cluster 1 outperform or underperform the Cluster 2 students in a particular subject just by taking a look at the above visualisation alone? Is it by 10 marks? Or 15?
This is where the concept of Centroids come in handy. Listen to the following lecture to understand its importance and how it is calculated.
Therefore, as mentioned in the video, the Centroids are essentially the cluster centres of a group of observations that help us in summarising the cluster’s properties. Thus as you saw in the video, the centroid value in the case of clustering is essentially the mean of all the observations that belong to a particular cluster. For example, in the dataset that you saw here,
The centroid is calculated by computing the mean of each and every column/dimension that you have and then ordering them in the same way as above.
- Therefore, Height-mean = ((175+165+183+172))/4 = 173.75.
- Weight-mean = ((83+74+98+80))/4 = 83.75.
- Age – mean = ((22+25+24+24))/4 =23.75.
- Thus the centroid of the above group of observations is (173.75, 83.75 and 23.75).
Now that you’ve understood how the centroids are calculated, answer the following question.