How to visualize data distribution of a categorical variable in Python

Bar charts can be used in many ways, one of the common use is to visualize the data distribution of categorical variables in data. X-axis being the unique category values and Y-axis being the frequency of each value.

In the below data, there is one column(APPROVE_LOAN) which is categorical and to understand how the data is distributed, you can use a bar chart.

import pandas as pd
ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']
DataValues=[[480, 28, 610000, 'Yes'],
             [480, 42, 140000, 'No'],
             [480, 29, 420000, 'No'],
             [490, 30, 420000, 'No'],
             [500, 27, 420000, 'No'],
             [510, 34, 190000, 'No'],
             [550, 24, 330000, 'Yes'],
             [560, 34, 160000, 'Yes'],
             [560, 25, 300000, 'Yes'],
             [570, 34, 450000, 'Yes'],
             [590, 30, 140000, 'Yes'],
             [600, 33, 600000, 'Yes'],
             [600, 22, 400000, 'Yes'],
             [600, 25, 490000, 'Yes'],
             [610, 32, 120000, 'Yes'],
             [630, 29, 360000, 'Yes'],
             [630, 30, 480000, 'Yes'],
             [660, 29, 460000, 'Yes'],
             [700, 32, 470000, 'Yes'],
             [740, 28, 400000, 'Yes']]

# Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())

###############################################
# Counting the occurrences of each unique category
GroupedData=LoanData.groupby(by='APPROVE_LOAN').size()
print(GroupedData)

# Generating a bar chart for a single column
%matplotlib inline
GroupedData.plot.bar()

import pandas as pd

ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']

DataValues=[[480, 28, 610000, 'Yes'],

[480, 42, 140000, 'No'],

[480, 29, 420000, 'No'],

[490, 30, 420000, 'No'],

[500, 27, 420000, 'No'],

[510, 34, 190000, 'No'],

[550, 24, 330000, 'Yes'],

[560, 34, 160000, 'Yes'],

[560, 25, 300000, 'Yes'],

[570, 34, 450000, 'Yes'],

[590, 30, 140000, 'Yes'],

[600, 33, 600000, 'Yes'],

[600, 22, 400000, 'Yes'],

[600, 25, 490000, 'Yes'],

[610, 32, 120000, 'Yes'],

[630, 29, 360000, 'Yes'],

[630, 30, 480000, 'Yes'],

[660, 29, 460000, 'Yes'],

[700, 32, 470000, 'Yes'],

[740, 28, 400000, 'Yes']]

# Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(LoanData.head())

###############################################

# Counting the occurrences of each unique category

GroupedData=LoanData.groupby(by='APPROVE_LOAN').size()

print(GroupedData)

# Generating a bar chart for a single column

%matplotlib inline

GroupedData.plot.bar()

Sample Output:

A bar chart for a single categorical column gives below information

What is the central tendency in the data (Mode value)
The imbalance in data, any value which is present very few times

What is the ideal output from a bar chart?

The ideal output would be that each bar is of the same height(frequency). This means each unique value is present an equal number of times, hence the data has enough values for each type of value to learn from. This is known as a balanced data.

Consider below example, here the number of “Yes” cases and “No” cases are present 10 times each. Hence the ML algorithm has the same number of examples of both cases to learn from.

# Creating a sample balanced data frame
import pandas as pd
ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']
DataValues=[[480, 28, 610000, 'No'],
             [480, 42, 140000, 'No'],
             [480, 29, 420000, 'No'],
             [490, 30, 420000, 'No'],
             [500, 27, 420000, 'No'],
             [510, 34, 190000, 'No'],
             [550, 24, 330000, 'No'],
             [560, 34, 160000, 'No'],
             [560, 25, 300000, 'No'],
             [570, 34, 450000, 'No'],
             [590, 30, 140000, 'Yes'],
             [600, 33, 600000, 'Yes'],
             [600, 22, 400000, 'Yes'],
             [600, 25, 490000, 'Yes'],
             [610, 32, 120000, 'Yes'],
             [630, 29, 360000, 'Yes'],
             [630, 30, 480000, 'Yes'],
             [660, 29, 460000, 'Yes'],
             [700, 32, 470000, 'Yes'],
             [740, 28, 400000, 'Yes']]

#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(LoanData.head())
####################################

# Counting the occurrences of each unique category
GroupedData=LoanData.groupby(by='APPROVE_LOAN').size()
print(GroupedData)

# Generating a bar chart for a single column
%matplotlib inline
GroupedData.plot.bar()

# Creating a sample balanced data frame

import pandas as pd

ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']

DataValues=[[480, 28, 610000, 'No'],

[480, 42, 140000, 'No'],

[480, 29, 420000, 'No'],

[490, 30, 420000, 'No'],

[500, 27, 420000, 'No'],

[510, 34, 190000, 'No'],

[550, 24, 330000, 'No'],

[560, 34, 160000, 'No'],

[560, 25, 300000, 'No'],

[570, 34, 450000, 'No'],

[590, 30, 140000, 'Yes'],

[600, 33, 600000, 'Yes'],

[600, 22, 400000, 'Yes'],

[600, 25, 490000, 'Yes'],

[610, 32, 120000, 'Yes'],

[630, 29, 360000, 'Yes'],

[630, 30, 480000, 'Yes'],

[660, 29, 460000, 'Yes'],

[700, 32, 470000, 'Yes'],

[740, 28, 400000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(LoanData.head())

####################################

# Counting the occurrences of each unique category

GroupedData=LoanData.groupby(by='APPROVE_LOAN').size()

print(GroupedData)

# Generating a bar chart for a single column

%matplotlib inline

GroupedData.plot.bar()

Sample Output:

What to do for imbalanced categorical data?

If the bar chart shows that there are too many unique values in a column and only one of them is dominating, then the data is imbalanced and such a column needs outlier treatment by grouping some of the values which are present with low frequency.

For example, in the below scenario, the category “C” is dominating and other values are present only once.

This type of data is not fit for machine learning. To make it useful, we can group the values “A”, “B” and “D” together and call it a single category, let’s say “ABD”. This operation will improve the distribution of the data as shown below. Now, this can be used for machine learning. This can be done in python using the replace() function of the pandas data frame.

Important thing to note is, to combine few values together, you must have little domain knowledge about the data, hence you will understand that whether this grouping is sensible or not.

Improved data distribution after grouping few categories — Improved data distribution after grouping a few categories

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

What is the ideal output from a bar chart?

What to do for imbalanced categorical data?

Leave a Reply! Cancel Reply