In this session, we covered in detail about two algorithms namely K-Mode and K-Prototype clustering.
To summarise, The K-modes clustering algorithm is based on K-means paradigm but removes the numeric data limitation while preserving its efficiency.
K-modes Algorithm uses modes instead of means to form clusters of categorical data.
Steps of the algorithm.
- Randomly assign “K” number of modes.
- Calculate the dissimilarity score between each of the remaining data points from the “K” number of chosen modes.
- Associate the data points to the mode whose score is minimum.
- Repeat from step 2 until there is no reassignment of clusters or when cost function is minimized.
For K-Prototype clustering, we combine K-means and K-Mode to handle both continuous and categorical data. For K-Prototype the distance function is as follows,
d(x,y)=∑pj=1(Xj−Yj)2+γ∑Mj=p+1δ(Xj−Yj)
Where gamma is the weighting factor that determines the relative importance of numerical categorical attributes.
Steps of the algorithm:
- Select k.
- Allocate each data point to a cluster which is done with considering the dissimilarity measure.
- Retest the similarity of objects against the current prototypes. Update the prototypes.
- Repeat 3, until no object changes its cluster.
We also talked briefly about the DBSCAN algorithm which is a density-based clustering algorithm that divides a data set into subgroups of high-density regions.