Data Science Interview Questions for IT Industry Part-4: Unsupervised ML

Unsupervised ML

In the previous post, I talked about the popular supervised machine learning algorithms. In this post, I will discuss some of the important Unsupervised machine learning algorithms which are asked in the data science interviews.

I will also show you, how to run these algorithms in R/Python.

Table of contents

Overview

Unsupervised means without any intervention. These algorithms do not need supervision to learn about the data. It means you don’t have to explicitly ‘train’ them by telling this is the input(X) and that is the output(y).

Unsupervised machine learning is about finding the patterns in data by just passing full data to the algorithm.

Unsupervised machine learning algorithms do not need supervision to learn about the data. It means you don’t have to explicitly ‘train’ them by telling this is the input(X) and that is the output(y).

Clustering, Dimension Reduction, and Association are the three major types of tasks that are performed by Unsupervised Machine Learning algorithms.

Types of Unsupervised Machine Learning algorithms
Types of Unsupervised Machine Learning algorithms

Clustering: Finding groups of data. Basically, put similar ROWS in data under one group. For example,

  • Finding the types of customers using shopping behavior data
  • Finding the types of users in telecom, based on their usage data.
  • How many groups are present in a given unknown data.

Important algorithms used for clustering are K-Means, Hierarchical clustering, DBSCAN, and OPTICS.

Dimension Reduction: Finding and combining similar columns in data. Basically reducing the number of columns(Dimension) to a lower number of columns.

Important algorithms used for Dimension Reduction are Factor Analysis, PCA, ICA, T-SNA, and UMAP.

Association: Finding out the patterns in the customer purchase behavior and generating the purchase recommendations. Basically, if a user purchases item-A, then which item they will likely purchase along with it.

Important algorithms used for Association are Apriori, Eclat, and FP-Growth.

Let’s discuss these algorithms one by one!


Q How does the K-Means clustering algorithm work?

K-Means clustering summary using R and Python
K-Means clustering summary

K-Means Algorithm is a simple yet powerful Unsupervised machine learning algorithm.
It is used to find groups/clusters of similar rows in data.
The working of K-Means is simple. Pass the data, tell how many clusters you want and K-Means will assign a group to every row in data.

Please look at the K-Means algorithm steps below

  1. If the value of K=2, then we need to find 2 groups in data
  2. Assign group-1 or group-2 label to each observation in the data randomly
  3. Find the “center” of all group-1 points and all group-2 points. These centers are known as “centroids”
  4. Find the distances of each point from both the “centroids”
  5. Assign each point to the nearest centroid. Hence, the corresponding group.
  6. Repeat steps 3-5 till there are no further changes in the positions of the centroids OR the maximum number of iterations are completed.
  7. In the end, each observation in data is assigned to one of the groups.

K-Means Clustering
K-Means Clustering

data –> K-Means(find 2 groups) –> Found 2 groups

On the same data, if you ask K-Means to find 3 groups, the algorithm will find you 3 groups as well!

data –> K-Means(find 3 groups) –> Found 3 groups

How do we find the optimal number of groups in data?
By running the process iteratively for K=1, 2, 3, 4…10 and noting down the “withinness” i.e. the similarity between all points within a group. This is measured as the sum of the distances of each point from the centroid.

If all the points are close to the centroid, that means the group is homogeneous. i.e. all observations are very similar.

Ideally, withinness should be Zero, but, in practice, we try to keep it as low as possible, since you never get Zero.

“Euclidean” distance is popular for withinness calculation.

Q. How to create K-Means in R?

Q. How to create K-Means in Python?


Q How does the Hierarchical clustering algorithm work?

Hierarchical clustering summary using R and Python
Hierarchical clustering summary

Hierarchical clustering is another algorithm that helps to group similar rows together.

Hierarchical clustering algorithm has two major types listed below

  1. Agglomerative (Bottom-Up Approach)
  2. Divisive (Top-Down Approach)

Agglomerative Clustering is a Bottom-Up approach, it works with the below steps.

  1. Assume every row is one cluster in itself, so if your data has 100 rows then start at the bottom with 100 clusters
  2. Compute the distance matrix of every cluster from every other cluster using a distance measure, e.g Euclidean/Manhattan. For text data/other non-numeric data we use Hamming or Levenshtein distance.
  3. Use the above distance matrix to calculate the distances between the clusters. The cluster distances can be of 3 types:
    • Single Linkage (The shortest distance between two clusters)
    • Complete Linkage (The largest distance between two clusters)
    • Average Linkage (The average distance between two clusters)
  4. Find which clusters are closest to each other
  5. Combine all those clusters that are closest to each other.
  6. keep repeating steps 2-5 until there is only 1 cluster left.

Types of Linkages in Hierarchical Clustering
Types of Linkages in Hierarchical Clustering

The above iterative process creates a chart known as the Dendrogram.

This diagram helps us to select the optimal number of clusters in the data.

The dendrogram in Hierarchical clustering
Dendrogram in Hierarchical clustering

If I “cut” the dendrogram at Distance=20,000, then I will get 2 clusters

If I “cut” the dendrogram at Distance=15,000, then I will get 4 clusters.

If I “cut” the dendrogram at Distance=10,000, then I will get 5 clusters.

So which one is the best? Where exactly to cut the Dendrogram?

We cut the dendrogram at the point where the horizontal line covers the maximum distance between 2 vertical lines.

Hence, we make sure the clusters are as far as possible. In the below scenario, we will cut the dendrogram at Distance=10,000. It will fetch us 5 clusters from the data.

Where to cut the dendrogram?
Where to cut the dendrogram?

Divisive Clustering (Top-Down approach) is exactly the opposite of Agglomerative. Start from the top with all rows in one cluster and keep dividing the clusters unless every row is one cluster in itself.

Q. How to create Hierarchical clustering in R?

Q. How to create Hierarchical clustering in Python?


Q How does the DBSCAN clustering algorithm work?

DBSCAN clustering summary using R and Python
DBSCAN clustering summary

DBSCAN stands for Density Based Spatial Clustering of Applications with Noise.

Do not get intimidated by this algorithm’s name, the more it sounds complex the simpler it is actually! 🙂

Lets quickly understand the relevant terminologies for DBSCAN.

  1. ε, epsilon (eps): is known as the maximum allowed distance from one point to the other point, for both of them to be considered in one group/cluster
  2. MinPts: is the minimum number of points which should be present close to each other at a distance of epsilon (ε) so that they all can be form a group/cluster
  3. Core Point: That point which has at least MinPts number of points near to it, within the distance of ε(eps)
  4. Border Point/Non-Core Point: That point in data which has less than the minimum number of points(MinPts) within its reach (a distance of eps)
  5. Noise: That point which has no point near to it within a distance of eps

In the below diagram, the minimum distance required (eps) is 2 units and the minimum number of points which must be near to the core point is 3.

DBSCAN Algorithm works based on eps and MinPts
DBSCAN Algorithm works based on eps and MinPts

You can observe in the above diagram:

  • The core point has 4 points within its circle of 2 units.
  • The Border point has only one point(Blue Point) near to it and hence it is not a core point.
  • The Noise (Red Point) has no points within its circle hence it is labeled as noise. Because it does not belong to any group/cluster

The DBSCAN algorithm is very simple. Look at its working steps below:

  1. Start with one point randomly, find out if it is a core point by checking the minimum number of points near to it by a distance of eps
  2. If it is a core point, make it a cluster and move to the next unvisited point to repeat step-1
  3. If the number of points within eps distance is fewer than MinPts then mark it as non-core point
  4. If the number of points within eps distance is Zero, then mark that point as Noise.
  5. Combine all those clusters together whose points are within eps distance. Also known as density connection or connected components. This starts a chain reaction. If cluster-1 and cluster-2 are connected and cluster-2 and cluster-3 are connected then cluster-1 and cluster-3 are also connected. Hence all these are combined to make one single cluster.

DBSCAN: All those groups which are density connected are put in one single cluster
DBSCAN: All those groups which are density connected are put in one single cluster

Q. How to create DBSCAN clustering in R?

Q. How to create DBSCAN clustering in Python?


Q How does the OPTICS clustering algorithm work?

OPTICS clustering summary using R and Python
OPTICS clustering summary

OPTICS stands for Ordering Points To Identify Clustering Structure.

Once again another fancy name but a very simple algorithm!

This algorithm can be seen as a generalization of DBSCAN. A major issue with DBSCAN is that it fails to find clusters of varying density due to fixed eps.

This is solved in OPTICS by using an approach of finding reachability of each point from the core points and then deciding the clusters based on reachability plot.

Lets quickly understand the relevant terminologies for OPTICS.

  1. ε, epsilon (eps): is known as the Maximum allowed distance from one point to the other point for both of them to be considered in one group/cluster
  2. MinPts: is the minimum number of points which should be present close to each other at a distance of epsilon (ε) so that they all can be form a group/cluster
  3. Core Point: That point which has at least MinPts number of points near to it, within the distance of ε(eps)
  4. Border Point/Non-Core Point: That point in data which has less than the minimum number of points(MinPts) within its reach (a distance of eps)
  5. Noise: That point which has no point near to it within a distance of eps

In addition to the above, there are two additional terminologies which are used in OPTICS:

  • Core Distance: The minimum distance required by a point to become a core point. It means it is possible to find the MinPts number of points within this distance. Core distance can be less than the pre-decided value of ε, epsilon (eps), which is the maximum allowed distance to find MinPts.
  • Reachability distance: This is the distance to reach a point from the cluster. Now if the point lies within the Core Distance, then Reachability Distance=Core Distance. And, if the point lies outside the Core Distance, then Reachability Distance is the minimum distance from the extreme point of the cluster.

Core Distance and Reachability Distance in OPTICS
Core Distance and Reachability Distance in OPTICS

The OPTICS algorithm works with below steps:

  1. For the given values of MinPts and eps(ε). Find out if a point is close to MinPts number of points within a distance less than or equal to eps. Tag it as a Core Point. Update the reachability distance = core distance for all the points within the cluster.
  2. If it is not a core point then find out its density connected distance from the nearest cluster. Update the reachability distance.
  3. Arrange the data in increasing order of reachability distance for each cluster. The smallest distances come first and represent the dense sections of data and the largest distances come next representing the noise section. This is a special type of dendrogram
  4. Find out the places where a sharp decline is happening in the reachability distance plot.
  5. “Cut” the plot in the y-axis by a suitable distance to get the clusters.

Basically those points which are close to each other and form a dense cluster will have small reachability distance and those points which are far away from all the clusters are Noize points, they don’t belong to any of the clusters.

In the below plot, you can see the reachability distances of all the points ordered. A sudden decline marks the starting of a new cluster.

Clusters are created by inspecting this graph and choosing an appropriate reachability distance threshold to define the clusters. In the below plot, I have used Reachability Distance=0.5 as the threshold. This can also be done automatically by providing a measure of sharp decline known as “Xi“. Its value varies between 0 to 1. Higher the value, sharper the decline in the plot.

Reachability distances are used to define the clusters in OPTICS
Reachability distances are used to define the clusters in OPTICS

Cluster Plot for OPTICS
Cluster Plot for OPTICS

Q. How to create OPTICS clustering in R?

Q. How to create OPTICS clustering in Python?


Q How does the Factor Analysis algorithm work?

Factor Analysis summary using R and Python
Factor Analysis summary

Factor analysis is a dimension reduction technique. The goal is to find out latent factors or hidden groups that are responsible for the observed data in the variables.

For example, if an employee is unhappy in an organization, there could be many reasons for it, but, if looked carefully, these reasons can be grouped together within some smaller set of MAJOR reasons! Factor analysis helps us find these major groups of variables in the data which are also known as Factors.

Factor analysis helps us find hidden groups(Factors) of variables in the data automatically

Look at the diagram below, it represents data from an employee survey. It shows the average ratings for each department, like how employees feel about the way their employer handles the complaints, how they feel about the privileges provided, how they feel about the learnings on the job and how they feel about their salary raises and career advancement, etc.

Factor Analysis helps to find hidden groups in the data
Factor Analysis helps to find hidden groups in the data

You can see that some of these responses represent a common subject or the Major reasons for the way they responded, e.g. If they are satisfied with their organization’s work culture, then scores will be high for overall rating, complaints, and privileges. If they feel they are learning on the job then scores will be high for learning opportunities, similarly for Promotions.

So there are 3 major reasons Work culture, Learning Opportunities, and Promotions, which actually DRIVES the ratings of smaller reasons like rating, complaints, etc.

Statistically, you can imagine the equations of the above diagram as below:

rating= β1*(Work Culture) + C1
complaints=β2*(Work Culture) + C2
privileges=β3*(Work Culture) + C3
Learning= β4*(Learning Opportunities) + C4
raises= β5*(Promotions) + C5

Factor Analysis helps to find these coefficients β1, β2, β3… and C1, C2, C3… etc.

These coefficients determine that what is the weightage of rating/complaints/privileges in Work Culture Factor, what is the weightage of raises/critical/advance in Promotions Factor etc.

These weightages are also known as Factor Loadings. Basically, what part of a given variable contributes to the group/Factor it belongs to.

Factor Loadings are the coefficient values for each variable, which represents its weightage in the Factor it belongs to.

Factor Analysis Vs PCA

Factor Analysis is often confused with Principal Component Analysis PCA!

Both are dimension reduction techniques, but, the main difference between Factor Analysis and PCA is the way they try to reduce the dimensions.

Factor Analysis tries to find the hidden groups of variables, which is like a common driving factor for the type of values in all those variables, e.g. customer survey, if you are unhappy with the taste of the coffee then all those questions about coffee taste in the survey will have low scores! Hence Bad Taste will be a Factor and questions like Coffee sweetness rating, Coffee bitterness rating, Coffee Freshness rating will represent the individual variables that will have low ratings.

Factor Analysis will form equations like below:

Coffee sweetness rating = β1 * (Bad Taste) +C1
Coffee bitterness rating = β2 * (Bad Taste) +C2
Coffee Freshness rating = β3 * (Bad Taste) +C3

PCA tries to find fewer new variables called Principal Components which are linear combinations of the variables in the data, hence these fewer variables “represent” all the variables, while reducing the dimensions. Here the goal is to create a new dataset that has the same number of rows but, lesser number of columns(Principal Components) which explains the variance of all the original variables in the data.

PCA will form equations like below:

PC1=α1 * ( Coffee sweetness rating) + α2(Coffee bitterness rating) + α3(Coffee Freshness rating)
PC2=β1 * ( Coffee sweetness rating) + β2(Coffee bitterness rating) + β3(Coffee Freshness rating)

Q. How to do Factor Analysis in R?

Q. How to do Factor Analysis in Python?


Q How does the PCA algorithm work?

Principal Component Analysis summary using R and Python
Principal Component Analysis summary

PCA stands for Principal Components Analysis.

It is one of the most popular algorithms which is used for dimension reduction (Reducing the number of columns/variables in data).

PCA represents the full data with fewer columns called Principal Components
PCA represents the full data with fewer columns called Principal Components

PCA represents the whole data with the help of fewer variables called principal components. These fewer number of new columns account for all the variance in the original data!

PCA works with below steps

  1. Find linear combinations of variables in data like
    • PC1=α1 * (rating) + α2(complaints) + α3(privileges). . .
    • PC1=α1 * (rating) + α2(complaints) + α3(privileges). . .
  2. While satisfying the independence criteria of all principal components. If you pick any two principal components and find correlation between them, it should be Zero.
  3. The first principal component explains the maximum amount of variance in data, the second principal component explains slightly less amount of variance, so on and so forth…

Q. How to run PCA in R?

Q. How to run PCA in Python?


Q How does the ICA algorithm work?

Independent Component Analysis summary
Independent Component Analysis summary

Independent Component Analysis(ICA) algorithm is similar to PCA. There is a very subtle difference!

In PCA we try to find linear combinations of variables as Principal Components, and there is NO correlation between the principal components.

In ICA we try to find linear combinations of variable as Independent Components, and there is no dependency between any two Components

What is the difference between correlation and independence?

Correlation signifies the mathematical relation between two vectors, like sales increase with an increase in discounts, hence there is a high positive correlation.

But the actual driver of sales would be the quality score of the product. Hence, we can say Sales are “dependent” on the quality score of the product.

Similarly, independence is also defined, like sales of shampoo is independent of the weather. But Sales of Ice Cream is dependent on weather.

Mathematically ICA works with below steps:

Establish an equation X=M*D where 

  • X= matrix of Independent Components
  • M=Mixing Matrix
  • D=Original Data

ICA algorithm finds independent components by multiplying a mixing matrix to original data
ICA algorithm finds independent components by multiplying a mixing matrix to original data

The goal of ICA is to find such a Matrix of mix “M” to be multiplied by the given data “D”, such that all columns in “X” are independent of each other.

Q. How to run ICA in R?

Q. How to run ICA in Python?


Q How does the T-SNE algorithm work?

T-SNE algorithm summary using R and Python
T-SNE algorithm summary

T-SNE stands for “t-distributed Stochastic Neighbor Embedding”.

This is another dimensionality reduction technique primarily aimed at visualizing data.

Since plotting a high dimension data (e.g 100 columns) is computationally intense as well as difficult to understand.

If the same data can be represented in 2 or 3 dimensions, it can be plotted easily and the data patterns can be understood.

This can also be done using PCA, but, it is a linear method. If the relationship between the features is non-linear, then PCA will fail to provide good components that are efficient in explaining the variance. This is where t-SNA comes into the picture.

t-SNE can find the non-linear relations between features and represent them in a lower dimension.

t-SNE works by finding a probability distribution in low dimensional space “similar” to the distribution of original high dimensional data.

t-SNE can find the non-linear relations between features and represent them in a lower dimension.

The working steps of t-SNE can be summarized as below

  1. Find the euclidean distance of every point from every other point in the data in high dimension
  2. Calculate the probability of two points being close to each other in a high dimension. If the distance is low then the probability is high.
  3. Find the euclidean distance of every point from every other point in the data in low dimension
  4. Calculate the probability of two points being close to each other in low dimension. If the distance is low then the probability is high.
  5. Minimize the difference in probabilities of high dimension and low dimension. 

The “t” in t-SNE is t-distribution. The reason for using a t-distribution instead of the normal distribution is the long tails of t-distribution, which allows the representation of long distances in a better way.

t-distribution has long tails as compared to the normal distribution
t-distribution has long tails as compared to the normal distribution

SNE is Stochastic Neighbor Embedding, this concept tries to keep the clusters of data points in higher dimension intact when these data points are encoded in the lower dimension.

Basically if data row numbers 1, 10 and 13 are “similar” and close to each other(less euclidean distance) in the high dimension, then they will be “similar” and close to each other in the low dimension as well. Hence, retaining the non-linear properties of data.

Q. How to run T-SNE in R?

Q. How to run T-SNE in Python?


Q How does the UMAP algorithm work?

UMAP summary using R and Python
UMAP summary

Uniform Manifold Approximation and Projection (UMAP) is an improvisation of the t-SNE algorithm. The basic concept is the same, projecting higher dimension data into lower dimensions.

Two noteworthy points regarding UMAP are

  • UMAP is faster than t-SNE
  • The algorithm is less computationally intense, hence, it can handle large datasets

UMAP uses an approach like the K-Nearest Neighbor to preserve the local and global patterns in data.

In simple terms, the points which are close in the higher dimension will be close in the lower dimension as well and vice versa.

To perform this activity, the UMAP algorithm requires two inputs

  1. n_neighbors: How many nearby points to look?
  2. min_dist: What should be the minimum distance between points so that they are considered as neighbors.

Using the above parameters, the UMAP algorithm finds out a relative position for each point in the lower dimension while keeping the relative distances of all points the same, hence, representing the high dimensional pattern.

Q. How to run UMAP in R?

Q. How to run UMAP in Python?


Q How does Association Rule Mining Works?

Association Rule Mining is a concept which tries to find patterns(rules) in the purchasing behavior of shoppers.

For example, if a shopper purchases a mobile phone, he is likely to purchase a mobile phone cover, so recommend them some!

This will increase the chances of selling the mobile phone cover. Since it is most required at the time of phone purchase.

One can take many critical decisions based on these rules. Like storing products nearby in shelves of a store to increase the chances of them getting purchased together. Recommending some products to online shoppers based on historical transactions, etc.

If a shopper purchases a mobile phone, he is likely to purchase a mobile phone cover, so recommend them some!

It would be a too complex a task for a human to find out all the rules, hence we use algorithms!

There are three major algorithms listed below that can find these rules for us.

  1. Apriori
  2. ECLAT
  3. FP-Growth

All these algorithms are based on simple counting.

Counting the number of times a product is purchased and calculating a few metrics known as Support, Confidence, and Lift.

Look at the below table of sample transactions to understand these metrics

Sample Transactions in a grocery store
Sample transactions in a grocery store

1. Support: The number of times a product or multiple products together are purchased out of all the transactions.

Example: The support of milk=(Total number of times milk is purchased)/(Total number of transactions)

Support(Milk)=4/6

Similarly, Support(Milk, Bread)=(Total number of times Milk and Bread is purchased together)/(Total number of transactions)

Support(Milk, Bread)=2/6

Basically support gives an understanding of how frequent the item or items are purchased.

If the support of a rule is high, that means that the combination of products is sold very frequently.

If the support of a rule is high, that means that the combination of products is sold very frequently.

2. Confidence: The number of times a second product is purchased after the purchase of a first product.

Example: Confidence(Milk->Bread) means the purchase of milk is leading to the purchase of Bread.

It is calculated as (How many times Milk and Bread are present together)/(How many times Milk is present in all the transactions)

Confidence(Milk->Bread) = 2/4

This can be interpreted as “If a person purchases Milk then 50% of the times they also purchase Bread”. 

Hence, this is an important rule to act upon, and it will wise to keep Bread near the Milk Tray!

Support Vs Confidence:

Support gives an idea about the frequency of a rule. If support is high, that means the products are sold frequently.

However, the importance of a rule can be judged by Confidence better, since, it helps us to be sure that the purchase of product-A leads to the purchase of product-B. If the confidence is high then we know for sure that if a person purchases product-A then he will purchase product-B as well.

Example: 

Support(Milk->Bread) = 2/6

Confidence(Milk->Bread)=2/4

Here the support is low, but the confidence is high! This is acceptable. 

If the support is high but confidence is low, then it is not acceptable.

If the support is high but confidence is low for a rule, then it is not acceptable.

3. Lift: It is defined as the ratio of support of two products being purchased together to the support of both products purchased independently.

Lift(A->B)= support(A,B)/(support(A) * support(B))

Based on the above formula we can say that:

  • if Lift>1: It means if one product is purchased then the other is likely to be purchased
  • if Lift<1: It means if one product is purchased then the other is NOT likely to be purchased
  • if Lift=1: It means there is no association

Example: Lift(Milk->Bread) = Support(Milk, Bread)/(Support(Milk) * Support(Bread))

Lift(Milk->Bread) = 0.33/(0.66 * 0.33) = 1.51

Hence we can say Milk and Bread are likely to be purchased together.

In general the higher the Lift, the higher the association is between products and the more actionable the rule is.

In general the higher the Lift, the higher the association is between products and the more actionable the rule is.

Now we have an understanding of the building blocks of association rule mining. Let’s discuss the algorithms one by one.


Q How does the Apriori algorithm work?

Apriori Summary using R and Python
Apriori Summary

This algorithm calculates the support, confidence, and lift for all possible combination of products which satisfies a minimum threshold of support and confidence. Hence, reducing the computation required. Otherwise, there will be too many combinations. 

This activity is performed in two major steps:

  1. Find out the frequent combination of items called “itemsets” which satisfy the minimum support threshold.
  2. Generate association from frequent itemsets rules based on the minimum confidence threshold.

The above style of finding association rules is based on the breadth-first search strategy, hence, it needs to scan the database multiple times. Due to this, the Apriori algorithm is slow. It is often criticized for not being able to handle very large datasets efficiently.

Apriori algorithm calculates the support, confidence, and lift for all possible combination of products which satisfies a minimum threshold of support and confidence.

Step-1: Finding the frequent itemsets: It is an iterative process, with pruning involved in each step.

We start with one item at a time and keep increasing the items in the itemset.

  1. items=1, start finding itemsets of one product each. Compute the support for each itemset and remove those which does not satisfy the minimum support threshold.
  2. items=2, start finding combinations of 2 products as an itemset. Compute the support for each itemset and remove those which does not satisfy the minimum support threshold.
  3. items=3, start finding combinations of 3 products as an itemset. Compute the support for each itemset and remove those which does not satisfy the minimum support threshold.
  4. keep repeating the process until maximum possible combinations are covered.

Step-2: Creating rules from itemsets: It is an iterative process, with elimination involved in each step.

Example: If an itemset contains 4 elements {Milk, Bread, Cold Drink, Wafer} then follow the below steps by starting with only one item in the RHS and keep decreasing the elements in LHS to formulate and evaluate new rules.

  1. Find the confidence of rule {Milk,Bread,Cold Drink -> Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  2. Find the confidence of rule {Milk,Bread -> Cold Drink,Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  3. Find the confidence of rule {Milk -> Bread,Cold Drink,Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  4. Find the confidence of rule {Bread,Cold Drink,Wafer -> Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  5. Find the confidence of rule {Bread,Cold Drink -> Wafer,Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  6. Find the confidence of rule {Bread -> Cold Drink,Wafer,Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.

So on and so forth… The above steps are repeated for each and every itemsets found with minimum support and confidence level.

In the end, we are left with those rules which satisfy the minimum support and confidence threshold.

Q. How to run Apriori in R?

Q. How to run Apriori in Python?


Q How does the Eclat algorithm work?

Eclat algorithm summary using R and Python
Eclat algorithm summary

ECLAT stands for Equivalence CLAss Transformation! The basic idea is the same as the apriori algorithm but the ECLAT algorithm addresses and improves two major issues in apriori.

  1. ECLAT uses Depth First Search while scanning the transactions database, this is done by defining a “vertical database” which is based on “tid-list” which is nothing but a list of transactions where an itemset(a combination of items) was purchased together.
  2. This approach reduces the number of database scans needed to find all the rules. Hence We can use ECLAT on large databases.
  3. ECLAT uses the “lattice-theoric” approach. It breaks down the largest combination of items(biggest lattice) into smaller subsets(sublattices) and processes each smaller subsets to find association rules. Hence can be executed in parallel and reduces the time taken.
  4. Eclat uses any one of the bottom-up approach / top-down approach/hybrid search approaches to find sublattices.

ECLAT generates association rules in two major steps which are the same as used in apriori.

  1. Find out the frequent combination of items called “itemsets” which satisfy the minimum support threshold.
  2. Generate association rules from frequent itemsets based on the minimum confidence threshold.

The Key difference is HOW ECLAT finds the frequent “itemsets”? Here ECLAT is more efficient than apriori.

The Key difference is HOW ECLAT finds the frequent “itemsets”? Here ECLAT is more efficient than apriori.

Step-1: Find out the frequent combination of items called “itemsets”

  • This begins by finding those combinations of items (itemsets) which contains the maximum number of items among all other itemsets and has the maximum frequency in the transactions database. These are known as “maximal frequent itemsets”.
  • All other itemsets are subsets of these itemsets. Hence these are the starting points (Starting Big Lattice).
  • Keep making smaller combinations of itemsets from “maximal frequent itemsets”
  • Example: If a maximal frequent itemset has 4 items {Milk,Bread,Cold Drink,Wafer} then the algorithm will start creating smaller subsets which satisfies minimum support threshold e.g. {Milk,Bread,Cold Drink}, {Milk,Bread,Wafer}, {Bread,Cold Drink,Wafer},{Milk,Bread}, {Milk}etc.

Step-2: Generate Association rules using the minimum confidence threshold.

  • For each of the itemsets calculate the confidence and keep eliminating those rules which do not satisfy the minimum confidence threshold.
  • Example: If an itemset contains 4 elements {Milk, Bread, Cold Drink, Wafer} then follow the below steps by starting with only one item in the RHS and keep decreasing the elements in LHS to formulate and evaluate new rules.
    • Find the confidence of rule {Milk,Bread,Cold Drink -> Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
    • Find the confidence of rule {Milk,Bread -> Cold Drink,Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
    • Find the confidence of rule {Milk -> Bread,Cold Drink,Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
    • Find the confidence of rule {Bread,Cold Drink,Wafer -> Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
    • Find the confidence of rule {Bread,Cold Drink -> Wafer,Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
    • Find the confidence of rule {Bread -> Cold Drink,Wafer,Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
    • So on and so forth…

The above steps are repeated for each and every itemsets found with minimum support and confidence level.

In the end, we are left with those rules which satisfy the minimum support and confidence threshold.

Q. How to run Eclat in R?

Q. How to run Eclat in Python?


Q How does the FP-Growth algorithm work?

FP-Growth summary using R and Python
FP-Growth summary

FP-Growth stands for Frequent Patterns Growth. This algorithm is faster than apriori but not suitable for large datasets due to high memory requirements.

Like Apriori and ECLAT, FP-Growth also generates rules in two steps:

  1. Find out the frequent combination of items called “itemsets” which satisfy the minimum support threshold.
  2. Generate association rules from frequent itemsets based on the minimum confidence threshold.

The main difference between apriori and FP-Growth is the way FP-Growth FINDS frequent itemsets.

FP-Growth algorithm uses a special tree data structure known as “FP-Tree” that stores the frequent patterns at one place

This tree helps to reduce the need for scanning databases dramatically! In only two scans, all the frequent itemsets are found.

Once all the frequent itemsets are found, then the rules are generated using the tree.

FP-Growth algorithm uses a special tree data structure known as “FP-Tree” that stores the frequent patterns at one place
This tree helps to reduce the need for scanning databases dramatically! In only two scans, all the frequent itemsets are found.

Let’s examine how FP-Growth works using below sample transactions dataset:

Sample Transactions in a grocery store
Sample transactions in a grocery store

Step-1. Find out the frequent combination of items called “itemsets”

1. Find out the support of every item in the transactions dataset and arrange them in decreasing order, if two items have the same support then arrange them alphabetically.

Ordering items based on support in decreasing order
Ordering items based on support in decreasing order

Choose only those items which are above a minimum support threshold.

for example: {Milk:4, Cold Drink:3, Bread:2, Wafer:2, Saffron:1}

Now if the minimum threshold=2 is applied, then Saffron will get eliminated and we will get only 

{Milk:4, Cold Drink:3, Bread:2, Wafer:2}

This simple elimination reduces a lot of computing effort! We will now construct the FP-Tree for only those remaining items.

Remaining items after Min Support threshold in FP-Growth
Remaining items after Min Support threshold in FP-Growth

2. Arrange the items present in every transaction in decreasing order of support

For example, {Cold Drink, Milk, Wafer} will be arranged like {Milk:4, Cold Drink:3,  Wafer:2}

Milk gets the first position because it has maximum support and Wafer gets the last position since it has the least support value.

3. Create the FP-Tree by inserting one transaction at a time

  • Transaction-1: Milk:4 gets inserted and frequency incremented by 1
  • Transaction-2: Milk:4, Bread:2 gets inserted, and frequency incremented by 1, hence, Milk has frequency=2 and others has frequency=1
  • Transaction-3: Milk:4, Bread:2 gets inserted and frequency is incremented by 1, hence, Milk has frequency 3 now and Bread has frequency=2
  • Transaction-4: Cold Drink:3, Wafer:2 gets inserted. This creates a new leg from the root, with frequency as 1 for both items
  • Transaction-5: Cold Drink:3 gets inserted in the same leg, the frequency is updated as 2
  • Transaction-6: Milk:4, Cold Drink:3, Wafer:2 gets inserted, the frequency is incremented by 1.

4. Read the FP-Tree to find the path(known as a prefix) of each item.

This helps to count the number of ways to reach an item in the tree.

To reach Bread, we have one way possible but it is repeated 2 times in the transactions. Milk->Bread

To reach Milk we start from the root node hence, it is an empty path.

Finding Frequent Itemsets from FP-Tree
Finding Frequent Itemsets from FP-Tree

5. Generate Frequent itemsets

Combine the item with its path and make sure all combinations are taken into account

For example: if We are reaching to Saffron via Milk->Bread, then the frequent itemsets will be

{Saffron,Milk}, {Saffron,Bread}, {Milk,Saffron,Bread}

Step-2: Generate Association rules using the minimum confidence threshold.

1. For each of the itemsets calculate the confidence and keep eliminating those rules which do not satisfy the minimum confidence threshold.

Example: If an itemset contains 4 elements {Milk, Bread, Cold Drink, Wafer} then follow below steps by starting with only one item 

In the RHS and keep decreasing the elements in LHS to formulate and evaluate new rules.

  • Find the confidence of rule {Milk,Bread,Cold Drink -> Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  • Find the confidence of rule {Milk,Bread -> Cold Drink,Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  • Find the confidence of rule {Milk -> Bread,Cold Drink,Wafer}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  • Find the confidence of rule {Bread,Cold Drink,Wafer -> Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  • Find the confidence of rule {Bread,Cold Drink -> Wafer,Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  • Find the confidence of rule {Bread -> Cold Drink,Wafer,Milk}, if it is more than the minimum confidence threshold then keep it otherwise reject.
  • So on and so forth…

The above steps are repeated for each and every itemsets found with minimum support and confidence level.

In the end, we are left with those rules which satisfy the minimum support and confidence threshold.

Q. How to run FP-Growth in R?

Q. How to run FP-Growth in Python?


Conclusion

In this post I have explained some of the important algorithms used for Unsupervised ML, with these, you will be able to solve a wide variety of problems falling under Unsupervised ML.

There are many more algorithms which you can explore further here!

Let me know in case you want an explanation for any other Unsupervised ML algorithm. I will include it in this post.

In the next post, I will talk about the interview questions related to reinforced machine learning.

All the best for that interview!

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach got him to start this blog!

Leave a Reply!

Your email address will not be published. Required fields are marked *