How to visualize data distribution of a categorical variable in Python

Bar charts can be used in many ways, one of the common use is to visualize the data distribution of categorical variables in data. X-axis being the unique category values and Y-axis being the frequency of each value.

In the below data, there is one column(APPROVE_LOAN) which is categorical and to understand how the data is distributed, you can use a bar chart.

Sample Output:

Bar chart for a single column in python
Bar chart for a single column in python

A bar chart for a single categorical column gives below information

  1. What is the central tendency in the data (Mode value)
  2. The imbalance in data, any value which is present very few times


What is the ideal output from a bar chart?

The ideal output would be that each bar is of the same height(frequency). This means each unique value is present an equal number of times, hence the data has enough values for each type of value to learn from. This is known as a balanced data.

Consider below example, here the number of “Yes” cases and “No” cases are present 10 times each. Hence the ML algorithm has the same number of examples of both cases to learn from.

Sample Output:

Bar chart for balanced data in python
Bar chart for balanced data in python

What to do for imbalanced categorical data?

If the bar chart shows that there are too many unique values in a column and only one of them is dominating, then the data is imbalanced and such a column needs outlier treatment by grouping some of the values which are present with low frequency.

For example, in the below scenario, the category “C” is dominating and other values are present only once.

Bar chart with imbalanced data
Bar chart with imbalanced data

This type of data is not fit for machine learning. To make it useful, we can group the values “A”, “B” and “D” together and call it a single category, let’s say “ABD”. This operation will improve the distribution of the data as shown below. Now, this can be used for machine learning. This can be done in python using the replace() function of the pandas data frame.

Important thing to note is, to combine few values together, you must have little domain knowledge about the data, hence you will understand that whether this grouping is sensible or not.

Improved data distribution after grouping few categories
Improved data distribution after grouping a few categories

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

Leave a Reply!

Your email address will not be published. Required fields are marked *