Let’s understand some of the factors that can impact the final clusters that you obtain from the K-means algorithm. This would also give you an idea about the issues that you must keep in mind before you start to make clusters to solve your business problem.
Thus, the major practical considerations involved in K-Means clustering are:
- The number of clusters that you want to divide your data points into, i.e. the value of K has to be pre-determined.
- The choice of the initial cluster centres can have an impact on the final cluster formation.
- The clustering process is very sensitive to the presence of outliers in the data.
- Since the distance metric used in the clustering process is the Euclidean distance, you need to bring all your attributes on the same scale. This can be achieved through standardisation.
- The K-Means algorithm does not work with categorical data.
- The process may not converge in the given number of iterations. You should always check for convergence.
You will understand some of these issues in detail and also see the ways to deal with them when you implement the K-means algorithm in python.
Now let’s look in detail how to choose K for K-Means algorithm.
Having understood about the approach of choosing K for K-Means algorithm, we will now look at silhouette analysis or silhouette coefficient. Silhouette coefficient is a measure of how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).
Let’s look at it in detail in the next lecture.
So to compute silhouette metric, we need to compute two measures i.e. a(i) and b(i) where,
- a(i) is the average distance from its own cluster(Cohesion).
- b(i) is the average distance from the nearest neighbour cluster(Separation).
Now, let’s look at how to combine cohesion and separation to compute the silhouette metric.
Additional reading
You can read more about K-Mode clustering here, We will be covering it in detail in the next section.