IKH

DB Scan Clustering

This is reading session, you are required to go through the text and understand the basic idea behind DBScan. You may also go through the link provided at the end, to better understand the topic.

DBSCAN is a density-based clustering algorithm that divides a data set into subgroups of high-density regions. DBSCAN groups together point that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks as outliers the points that are in low-density regions.

DBScan Parameters

DBSCAN algorithm requires 2 parameters:

  • Epsom or EPS.
  • MinPoints or MinSamples.

EPS

EPS is a distance parameter that defines the radius to search for nearby neighbours. We can imagine each data point having a circle with radius EPS drawn around it.

The value of EPS taken to cluster the data has a significant impact on the results. If the value of EPS is considered too small, decidedly fewer data points will be considered in one cluster, and a large part of the data will not be clustered. The un-clustered data points will be considered as outliers because they don’t satisfy the number of points to create a dense region. If the EPS value is chosen to be very high, no real clusters will be formed as all of them will merge in the same cluster. The eps should be chosen based on the distance of the dataset (we can use a k-distance graph to find it), but in general small eps values are preferable.

Min Samples

Min Samples or Min Points are the number of minimum points to form a dense region or cluster. For example, if we set the min_samples as 5, we need at least 5 points to form a dense cluster. 

Minimum points can be selected from some dimensions (D) in the data set, as a general rule min points >=D+1. 

The DBSCAN algorithm is used to find associations and structures in the data that are usually hard to find manually.

Use this link to visualise the DBSCAN algorithm in action.

Additional Resources

Application of DBSCAN at Netflix: Read Here.

Application of DBSCAN in Geolocated data: Read Here.

Original Paper on DBSCAN posted on KDD by Martin Ester: Read Here.

Report an error