How to create Hierarchical clustering in Python

Hierarchical clustering is a popular technique to group similar rows together in data. More information about it can be found here.

You can learn more about Hierarchical clustering in the below video.

The below code snippet will help to create Hierarchical clustering.

Creating sample data

# Code to create data for Hierarchical Clustering in Python

# Creating the sample data for clustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# create sample data for clustering
SampleData = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=1.5,random_state=40)

# create np array for data points
X = SampleData&#91;0]
y = SampleData&#91;1]

# Creating a Data Frame to represent the data with labels
ClusterData=pd.DataFrame(list(zip(X&#91;:,0],X&#91;:,1],y)), columns=&#91;'X1','X2','ClusterID'])
print(ClusterData.head())

# create scatter plot to visualize the data
%matplotlib inline
plt.scatter(ClusterData&#91;'X1'], ClusterData&#91;'X2'])

# Code to create data for Hierarchical Clustering in Python

# Creating the sample data for clustering

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

# create sample data for clustering

SampleData = make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=1.5,random_state=40)

# create np array for data points

X = SampleData[0]

y = SampleData[1]

# Creating a Data Frame to represent the data with labels

ClusterData=pd.DataFrame(list(zip(X[:,0],X[:,1],y)), columns=['X1','X2','ClusterID'])

print(ClusterData.head())

# create scatter plot to visualize the data

%matplotlib inline

plt.scatter(ClusterData['X1'], ClusterData['X2'])

Sample Output:

Creating dendrogram

# create dendrogram to find best number of clusters
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

# create dendrogram to find best number of clusters

import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

Sample Output

Creating Final clusters based on Dengrogram

# Creating 2 Clusters Based on the above dendogram visually Bottom-Up hierarchical clustering
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=2, affinity = 'euclidean', linkage = 'ward')

# Generating cluster id for each row using agglomerative algorithm
ClusterData&#91;'PredictedClusterID']=hc.fit_predict(X)
print(ClusterData.head())

#Plotting the predicted clusters
plt.scatter(ClusterData&#91;'X1'], ClusterData&#91;'X2'], c=ClusterData&#91;'PredictedClusterID'])

# Use of Linkage
# "ward" minimizes the variance of the clusters being merged.
#"average" uses the average of the distances of each observation of the two sets.
# "complete" or maximum linkage uses the maximum distances between all observations of the two sets.

# Creating 2 Clusters Based on the above dendogram visually Bottom-Up hierarchical clustering

from sklearn.cluster import AgglomerativeClustering

hc = AgglomerativeClustering(n_clusters=2, affinity = 'euclidean', linkage = 'ward')

# Generating cluster id for each row using agglomerative algorithm

ClusterData['PredictedClusterID']=hc.fit_predict(X)

print(ClusterData.head())

#Plotting the predicted clusters

plt.scatter(ClusterData['X1'], ClusterData['X2'], c=ClusterData['PredictedClusterID'])

# Use of Linkage

# "ward" minimizes the variance of the clusters being merged.

#"average" uses the average of the distances of each observation of the two sets.

# "complete" or maximum linkage uses the maximum distances between all observations of the two sets.

Sample Output

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Creating sample data

Creating dendrogram

Creating Final clusters based on Dengrogram

Leave a Reply! Cancel Reply