Hierarchical clustering is a popular technique to group similar rows together in data. More information about it can be found here.
You can learn more about Hierarchical clustering in the below video.
The below code snippet will help to create Hierarchical clustering.
Creating sample data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Code to create data for Hierarchical Clustering in Python # Creating the sample data for clustering from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import numpy as np import pandas as pd # create sample data for clustering SampleData = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=1.5,random_state=40) # create np array for data points X = SampleData[0] y = SampleData[1] # Creating a Data Frame to represent the data with labels ClusterData=pd.DataFrame(list(zip(X[:,0],X[:,1],y)), columns=['X1','X2','ClusterID']) print(ClusterData.head()) # create scatter plot to visualize the data %matplotlib inline plt.scatter(ClusterData['X1'], ClusterData['X2']) |
Sample Output:
Creating dendrogram
1 2 3 |
# create dendrogram to find best number of clusters import scipy.cluster.hierarchy as sch dendrogram = sch.dendrogram(sch.linkage(X, method='ward')) |
Sample Output
Creating Final clusters based on Dengrogram
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Creating 2 Clusters Based on the above dendogram visually Bottom-Up hierarchical clustering from sklearn.cluster import AgglomerativeClustering hc = AgglomerativeClustering(n_clusters=2, affinity = 'euclidean', linkage = 'ward') # Generating cluster id for each row using agglomerative algorithm ClusterData['PredictedClusterID']=hc.fit_predict(X) print(ClusterData.head()) #Plotting the predicted clusters plt.scatter(ClusterData['X1'], ClusterData['X2'], c=ClusterData['PredictedClusterID']) # Use of Linkage # "ward" minimizes the variance of the clusters being merged. #"average" uses the average of the distances of each observation of the two sets. # "complete" or maximum linkage uses the maximum distances between all observations of the two sets. |
Sample Output
Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!