quick Recap

Before proceeding to the implementation of K-Means clustering using PySpark, let’s quickly revise the concepts involved in building a K-Means model.

In the next video, Ajay will explain the basics of the K-Means algorithm.

As you learnt in the video above, unsupervised learning identifies hidden patterns in data and draws inferences from data sets without labelled responses. Clustering is the most popular unsupervised learning method. As evident from the figure given below, data points can be clustered according to a pattern.

Clustering involves finding natural groups in the feature space of the input data. Two of the most commonly used clustering algorithms are as follows:

K-Means clustering.
Hierarchical clustering.

In this session, you will learn about K-Means clustering and understand its implementation. Clustering can be implemented in the following scenarios:

Customer segmentation.
Marketing
Document clustering.
Healthcare
Social media

Select K, the number of clusters you want. Let’s select K = 4.
Initialise K random points as the centroids of the initial clusters.
Measure the euclidean distance between each data point and each centroid, and assign each data point to its closest centroid and corresponding cluster.
Recalculate the midpoint (centroid) of each cluster.
Repeat steps 3 and 4 to reassign data points to clusters based on the new centroid locations.
Stop the process when the centroids have been stabilised. After computing the centroid of a cluster, no data points are reassigned.

The K-Means algorithm categorises data points into k similar clusters (groups). The steps involved in implementing the algorithm are as follow:

These six steps can be visualised using the flowchart given below.

Once the clusters are formed, the loss function is used to determine how good the clusters are. The loss function can be expressed as follows:

J=∑ni=1||Xi−μk(i)||2

Where,

J is the loss function.
X_i is the ith data point.
In μk(i) , k(i) gives the cluster number corresponding to the i^th data point, and μk(i) is the centroid associated with that data point.

As evident from the equation, the loss function is the sum of squared distances between each point and the associated cluster centre. In order to form the best possible cluster, the loss function should be minimum.

You can use the elbow method to select the optimal value of K. A curve is plotted between K and the sum of squared distances.

As you can observe in the image above, the cost function decreases rapidly with an increasing k. However, after a certain value of k, the decrease in the cost function is not that prominent. So, the graph looks like an elbow, and hence, it is named the ‘elbow curve’. In this example, k=6 is the optimal value of k as the sum of squared distances decrease is not that prominent. Therefore, in this case, the optimal value for k is 6.

Another important parameter that is used to test the accuracy of the model is the Silhouette Coefficient Score. This score ranges from -1 to 1, where.

k=1 indicates that clusters are well separated and clearly distinguished.
k=0 implies the distance between clusters is not significant.
k=1: This means that clusters are assigned incorrectly.

The Silhouette Coefficient Score can be calculated using the following formula:

SilhouetteScore=(b−a)max(a,b)

Where,

a is the average intra-cluster distance, i.e., the average distance between points within a cluster.
b is the average inter-cluster distance, i.e., the average distance between all clusters.

The values of a and b can be visualised using the graph given below.

Now that you are familiar with the basics of the K-Means algorithm, in the next segment, you will learn about the data set used in the K-Means case study that we will go through in the subsequent segments.

To better understand these concepts, it is advisable to revisit Module 5 Unsupervised Learning: Clustering from the previous course on Machine Learning – 1. The session on K-Means Clustering will help you from the base for the upcoming segments.

Report an error