Bar charts can be used in many ways, one of the common use is to visualize the data distribution of categorical variables in data. X-axis being the unique category values and Y-axis being the frequency of each value.
In the below data, there is one column(APPROVE_LOAN) which is categorical and to understand how the data is distributed, you can use a bar chart.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
import pandas as pd ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN'] DataValues=[[480, 28, 610000, 'Yes'], [480, 42, 140000, 'No'], [480, 29, 420000, 'No'], [490, 30, 420000, 'No'], [500, 27, 420000, 'No'], [510, 34, 190000, 'No'], [550, 24, 330000, 'Yes'], [560, 34, 160000, 'Yes'], [560, 25, 300000, 'Yes'], [570, 34, 450000, 'Yes'], [590, 30, 140000, 'Yes'], [600, 33, 600000, 'Yes'], [600, 22, 400000, 'Yes'], [600, 25, 490000, 'Yes'], [610, 32, 120000, 'Yes'], [630, 29, 360000, 'Yes'], [630, 30, 480000, 'Yes'], [660, 29, 460000, 'Yes'], [700, 32, 470000, 'Yes'], [740, 28, 400000, 'Yes']] # Create the Data Frame LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(LoanData.head()) ############################################### # Counting the occurrences of each unique category GroupedData=LoanData.groupby(by='APPROVE_LOAN').size() print(GroupedData) # Generating a bar chart for a single column %matplotlib inline GroupedData.plot.bar() |
Sample Output:

A bar chart for a single categorical column gives below information
- What is the central tendency in the data (Mode value)
- The imbalance in data, any value which is present very few times
What is the ideal output from a bar chart?
The ideal output would be that each bar is of the same height(frequency). This means each unique value is present an equal number of times, hence the data has enough values for each type of value to learn from. This is known as a balanced data.
Consider below example, here the number of “Yes” cases and “No” cases are present 10 times each. Hence the ML algorithm has the same number of examples of both cases to learn from.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# Creating a sample balanced data frame import pandas as pd ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN'] DataValues=[[480, 28, 610000, 'No'], [480, 42, 140000, 'No'], [480, 29, 420000, 'No'], [490, 30, 420000, 'No'], [500, 27, 420000, 'No'], [510, 34, 190000, 'No'], [550, 24, 330000, 'No'], [560, 34, 160000, 'No'], [560, 25, 300000, 'No'], [570, 34, 450000, 'No'], [590, 30, 140000, 'Yes'], [600, 33, 600000, 'Yes'], [600, 22, 400000, 'Yes'], [600, 25, 490000, 'Yes'], [610, 32, 120000, 'Yes'], [630, 29, 360000, 'Yes'], [630, 30, 480000, 'Yes'], [660, 29, 460000, 'Yes'], [700, 32, 470000, 'Yes'], [740, 28, 400000, 'Yes']] #Create the Data Frame LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(LoanData.head()) #################################### # Counting the occurrences of each unique category GroupedData=LoanData.groupby(by='APPROVE_LOAN').size() print(GroupedData) # Generating a bar chart for a single column %matplotlib inline GroupedData.plot.bar() |
Sample Output:

What to do for imbalanced categorical data?
If the bar chart shows that there are too many unique values in a column and only one of them is dominating, then the data is imbalanced and such a column needs outlier treatment by grouping some of the values which are present with low frequency.
For example, in the below scenario, the category “C” is dominating and other values are present only once.

This type of data is not fit for machine learning. To make it useful, we can group the values “A”, “B” and “D” together and call it a single category, let’s say “ABD”. This operation will improve the distribution of the data as shown below. Now, this can be used for machine learning. This can be done in python using the replace() function of the pandas data frame.
Important thing to note is, to combine few values together, you must have little domain knowledge about the data, hence you will understand that whether this grouping is sensible or not.

