Density Based Spatial Clustering of Applications with Noise(DBSCAN) is one of the clustering algorithms which can find clusters in noisy data. It works even on those datasets where K-Means fail to find meaningful clusters. More information about it can be found here.
You can learn more about the DBSCAN algorithm in the below video.
The below code snippet will help to create clusters in data using DBSCAN.
Creating data for clustering
1 2 3 4 5 6 |
# importing plotting library import matplotlib.pyplot as plt # Create Sample data from sklearn.datasets import make_moons X, y= make_moons(n_samples=500, shuffle=True, noise=0.1, random_state=20) plt.scatter(x= X[:,0], y= X[:,1]) |
Sample Output:

Finding Best hyperparameters for DBSCAN using Silhouette Coefficient
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b – a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples – 1.
The best value of the Silhouette Coefficient is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
## Finding best values of eps and min_samples import numpy as np import pandas as pd from sklearn.metrics import silhouette_score from sklearn.cluster import DBSCAN # Defining the list of hyperparameters to try eps_list=np.arange(start=0.1, stop=0.9, step=0.01) min_sample_list=np.arange(start=2, stop=5, step=1) # Creating empty data frame to store the silhouette scores for each trials silhouette_scores_data=pd.DataFrame() for eps_trial in eps_list: for min_sample_trial in min_sample_list: # Generating DBSAN clusters db = DBSCAN(eps=eps_trial, min_samples=min_sample_trial) if(len(np.unique(db.fit_predict(X)))>1): sil_score=silhouette_score(X, db.fit_predict(X)) else: continue trial_parameters="eps:" + str(eps_trial.round(1)) +" min_sample :" + str(min_sample_trial) silhouette_scores_data=silhouette_scores_data.append(pd.DataFrame(data=[[sil_score,trial_parameters]], columns=["score", "parameters"])) # Finding out the best hyperparameters with highest Score silhouette_scores_data.sort_values(by='score', ascending=False).head(1) |
Sample Output

Creating clusters using the best hyperparameters
1 2 3 4 5 |
# DBSCAN Clustering from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.18, min_samples=2) # Plotting the clusters plt.scatter(x= X[:,0], y= X[:,1], c=db.fit_predict(X)) |

Hi! Thanks for the code snippet. Just a heads up it appears there may be a rendering error in line 20:
if(len(np.unique(db.fit_predict(X)))>1):
Of course it’s rendering properly in my comment lololol. Anyway, thanks again!