How to do Clustering using K-Means in Python

K-Means is one of the most popular algorithms for clustering data, basically finding similar groups of rows in data. More information about it can be found here.

You can learn about K-Means in the below video.

Below code snippet will help to create clusters of data in python.

Creating sample data for K-Means

# Sample code to create K-Means in Python
# Creating the sample data for clustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# create sample data for clustering
SampleData =make_blobs(n_samples=100,n_features=2,centers=2,cluster_std=1.5,random_state=40)

#create np array for data points
X = SampleData&#91;0]
y = SampleData&#91;1]

# Creating a Data Frame to represent the data with labels
ClusterData=pd.DataFrame(list(zip(X&#91;:,0],X&#91;:,1],y)), columns=&#91;'X1','X2','ClusterID'])
print(ClusterData.head())

# create scatter plot to visualize the data
%matplotlib inline
plt.scatter(ClusterData&#91;'X1'], ClusterData&#91;'X2'])

# Sample code to create K-Means in Python

# Creating the sample data for clustering

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

# create sample data for clustering

SampleData =make_blobs(n_samples=100,n_features=2,centers=2,cluster_std=1.5,random_state=40)

#create np array for data points

X = SampleData[0]

y = SampleData[1]

# Creating a Data Frame to represent the data with labels

ClusterData=pd.DataFrame(list(zip(X[:,0],X[:,1],y)), columns=['X1','X2','ClusterID'])

print(ClusterData.head())

# create scatter plot to visualize the data

%matplotlib inline

plt.scatter(ClusterData['X1'], ClusterData['X2'])

Sample Output

Finding Best number of Clusters

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Creating Empty List to store Inertia value
Intertia = &#91;]
# Running K-Means 10 times to find the optimal number of clusters
for i in range(1, 11):
    km = KMeans(n_clusters=i,
                n_init=10, 
                max_iter=300, 
                random_state=0)
    km.fit(X)
    Intertia.append(km.inertia_)

# Plotting the curve to find the optimal number of clusters
# That point where the line starts to become horizontal is the ideal value
plt.plot(range(1, 11), Intertia, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.tight_layout()
plt.show()

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# Creating Empty List to store Inertia value

Intertia = []

# Running K-Means 10 times to find the optimal number of clusters

for i in range(1, 11):

km = KMeans(n_clusters=i,

n_init=10,

max_iter=300,

random_state=0)

km.fit(X)

Intertia.append(km.inertia_)

# Plotting the curve to find the optimal number of clusters

# That point where the line starts to become horizontal is the ideal value

plt.plot(range(1, 11), Intertia, marker='o')

plt.xlabel('Number of clusters')

plt.ylabel('Inertia')

plt.tight_layout()

plt.show()

Sample Output

Finding the best number of clusters in K-Means

Creating the best number of Clusters

# Creating 2 Clusters Based on the above graph
km = KMeans(n_clusters=2,n_init=10, max_iter=300, random_state=0)
km.fit(X)
ClusterData&#91;'PredictedClusterID']=km.predict(X)
print(ClusterData.head())

# Plotting the predicted clusters
plt.scatter(ClusterData&#91;'X1'], ClusterData&#91;'X2'], c=ClusterData&#91;'PredictedClusterID'])

# Creating 2 Clusters Based on the above graph

km = KMeans(n_clusters=2,n_init=10, max_iter=300, random_state=0)

km.fit(X)

ClusterData['PredictedClusterID']=km.predict(X)

print(ClusterData.head())

# Plotting the predicted clusters

plt.scatter(ClusterData['X1'], ClusterData['X2'], c=ClusterData['PredictedClusterID'])

Sample Output

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Creating sample data for K-Means

Finding Best number of Clusters

Creating the best number of Clusters

Leave a Reply! Cancel Reply