How to visualize the relationship between two categorical variables in Python

This situation occurs while performing classification. Here the target variable is categorical, hence the predictors can either be continuous or categorical. Hence, when the predictor is also categorical, then you use grouped bar charts to visualize the correlation between the variables.

Consider the below example, where the target variable is “APPROVE_LOAN”. One of the predictors is “GENDER”, so in order to understand whether there is an effect of Gender on the approval of a loan or not, you plot grouped bar chart.

No-Correlation example

# Creating a sample data frame
import pandas as pd
ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN']
DataValues=[ [480, 28, 'M', 610000, 'Yes'],
             [480, 42, 'M',140000, 'No'],
             [480, 29, 'F',420000, 'No'],
             [490, 30, 'M',420000, 'No'],
             [500, 27, 'M',420000, 'No'],
             [510, 34, 'F',190000, 'No'],
             [550, 24, 'M',330000, 'Yes'],
             [560, 34, 'M',160000, 'Yes'],
             [560, 25, 'F',300000, 'Yes'],
             [570, 34, 'M',450000, 'Yes'],
             [590, 30, 'F',140000, 'Yes'],
             [600, 33, 'M',600000, 'Yes'],
             [600, 22, 'M',400000, 'Yes'],
             [600, 25, 'F',490000, 'Yes'],
             [610, 32, 'M',120000, 'Yes'],
             [630, 29, 'F',360000, 'Yes'],
             [630, 30, 'M',480000, 'Yes'],
             [660, 29, 'F',460000, 'Yes'],
             [700, 32, 'M',470000, 'Yes'],
             [740, 28, 'M',400000, 'Yes']]
 
#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#################################################
# Cross tabulation between GENDER and APPROVE_LOAN
CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN'])
print(CrosstabResult)

# Grouped bar chart between GENDER and APPROVE_LOAN
%matplotlib inline # only needed for jupyter notebook
CrosstabResult.plot.bar()

# Creating a sample data frame

import pandas as pd

ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN']

DataValues=[ [480, 28, 'M', 610000, 'Yes'],

[480, 42, 'M',140000, 'No'],

[480, 29, 'F',420000, 'No'],

[490, 30, 'M',420000, 'No'],

[500, 27, 'M',420000, 'No'],

[510, 34, 'F',190000, 'No'],

[550, 24, 'M',330000, 'Yes'],

[560, 34, 'M',160000, 'Yes'],

[560, 25, 'F',300000, 'Yes'],

[570, 34, 'M',450000, 'Yes'],

[590, 30, 'F',140000, 'Yes'],

[600, 33, 'M',600000, 'Yes'],

[600, 22, 'M',400000, 'Yes'],

[600, 25, 'F',490000, 'Yes'],

[610, 32, 'M',120000, 'Yes'],

[630, 29, 'F',360000, 'Yes'],

[630, 30, 'M',480000, 'Yes'],

[660, 29, 'F',460000, 'Yes'],

[700, 32, 'M',470000, 'Yes'],

[740, 28, 'M',400000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(LoanData.head())

#################################################

# Cross tabulation between GENDER and APPROVE_LOAN

CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN'])

print(CrosstabResult)

# Grouped bar chart between GENDER and APPROVE_LOAN

%matplotlib inline # only needed for jupyter notebook

CrosstabResult.plot.bar()

Sample Output

Visualizing relationship between two categorical variables using a grouped bar chart

If the bars of the category “M” is similar to the bars of the category “F”, then you can say the GENDER and APPROVE_LOAN are NOT correlated.

The reason behind it is simple. If the bars are similar, that means if we change the gender, we cannot say that the loans are more approved or less approved, the ratio of approval Vs non-approval is the same for both the genders.

If the grouped bars are of different length for each category, then the variables are correlated to each other

Correlated variables example

Consider another scenario of the same data shown below, here the ratios of approval vs non-approval of loans are different for category “M” and “F”. Hence, you can say that changing the gender will impact the loan approval. Hence, there is a correlation between these two variables.

# Creating a sample data frame
import pandas as pd
ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN']
DataValues=[ [480, 28, 'M', 610000, 'Yes'],
             [480, 42, 'M',140000, 'No'],
             [480, 29, 'M',420000, 'No'],
             [490, 30, 'M',420000, 'No'],
             [500, 27, 'M',420000, 'No'],
             [510, 34, 'F',190000, 'No'],
             [550, 24, 'M',330000, 'Yes'],
             [560, 34, 'M',160000, 'No'],
             [560, 25, 'F',300000, 'Yes'],
             [570, 34, 'M',450000, 'Yes'],
             [590, 30, 'F',140000, 'Yes'],
             [600, 33, 'F',600000, 'Yes'],
             [600, 22, 'M',400000, 'No'],
             [600, 25, 'F',490000, 'Yes'],
             [610, 32, 'F',120000, 'Yes'],
             [630, 29, 'F',360000, 'Yes'],
             [630, 30, 'F',480000, 'Yes'],
             [660, 29, 'F',460000, 'Yes'],
             [700, 32, 'M',470000, 'Yes'],
             [740, 28, 'M',400000, 'Yes']]
 
#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
#########################################################
# Cross tabulation between GENDER and APPROVE_LOAN
CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN'])
print(CrosstabResult)

# Grouped bar chart between GENDER and APPROVE_LOAN
CrosstabResult.plot.bar(figsize=(7,4), rot=0)

# Creating a sample data frame

import pandas as pd

ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN']

DataValues=[ [480, 28, 'M', 610000, 'Yes'],

[480, 42, 'M',140000, 'No'],

[480, 29, 'M',420000, 'No'],

[490, 30, 'M',420000, 'No'],

[500, 27, 'M',420000, 'No'],

[510, 34, 'F',190000, 'No'],

[550, 24, 'M',330000, 'Yes'],

[560, 34, 'M',160000, 'No'],

[560, 25, 'F',300000, 'Yes'],

[570, 34, 'M',450000, 'Yes'],

[590, 30, 'F',140000, 'Yes'],

[600, 33, 'F',600000, 'Yes'],

[600, 22, 'M',400000, 'No'],

[600, 25, 'F',490000, 'Yes'],

[610, 32, 'F',120000, 'Yes'],

[630, 29, 'F',360000, 'Yes'],

[630, 30, 'F',480000, 'Yes'],

[660, 29, 'F',460000, 'Yes'],

[700, 32, 'M',470000, 'Yes'],

[740, 28, 'M',400000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(LoanData.head())

#########################################################

# Cross tabulation between GENDER and APPROVE_LOAN

CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN'])

print(CrosstabResult)

# Grouped bar chart between GENDER and APPROVE_LOAN

CrosstabResult.plot.bar(figsize=(7,4), rot=0)

Sample Output

Grouped bar charts showing the correlation between GENDER and APPROVE_LOAN

Now, here you can see the difference in the ratios! Simply put, your loan will get approved if you are Female! And if you are a Male then there are 50/50 chances of approval. Gender affects the approval rate. Hence, gender and loan approval are correlated here.

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

How to visualize the relationship between two categorical variables in Python

No-Correlation example

Correlated variables example

1 thought on “How to visualize the relationship between two categorical variables in Python”

Leave a Reply! Cancel Reply