K-Means is one of the most popular algorithms for clustering data, basically finding similar groups of rows in data. More information about it can be found here.
You can learn about K-Means in the below video.
Below code snippet will help to create clusters of data in python.
Creating sample data for K-Means
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Sample code to create K-Means in Python # Creating the sample data for clustering from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import numpy as np import pandas as pd # create sample data for clustering SampleData =make_blobs(n_samples=100,n_features=2,centers=2,cluster_std=1.5,random_state=40) #create np array for data points X = SampleData[0] y = SampleData[1] # Creating a Data Frame to represent the data with labels ClusterData=pd.DataFrame(list(zip(X[:,0],X[:,1],y)), columns=['X1','X2','ClusterID']) print(ClusterData.head()) # create scatter plot to visualize the data %matplotlib inline plt.scatter(ClusterData['X1'], ClusterData['X2']) |
Sample Output

Finding Best number of Clusters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Creating Empty List to store Inertia value Intertia = [] # Running K-Means 10 times to find the optimal number of clusters for i in range(1, 11): km = KMeans(n_clusters=i, n_init=10, max_iter=300, random_state=0) km.fit(X) Intertia.append(km.inertia_) # Plotting the curve to find the optimal number of clusters # That point where the line starts to become horizontal is the ideal value plt.plot(range(1, 11), Intertia, marker='o') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.tight_layout() plt.show() |
Sample Output

Creating the best number of Clusters
1 2 3 4 5 6 7 8 |
# Creating 2 Clusters Based on the above graph km = KMeans(n_clusters=2,n_init=10, max_iter=300, random_state=0) km.fit(X) ClusterData['PredictedClusterID']=km.predict(X) print(ClusterData.head()) # Plotting the predicted clusters plt.scatter(ClusterData['X1'], ClusterData['X2'], c=ClusterData['PredictedClusterID']) |
Sample Output

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!