This situation occurs while performing classification. Here the target variable is categorical, hence the predictors can either be continuous or categorical. Hence, when the predictor is also categorical, then you use grouped bar charts to visualize the correlation between the variables.
Consider the below example, where the target variable is “APPROVE_LOAN”. One of the predictors is “GENDER”, so in order to understand whether there is an effect of Gender on the approval of a loan or not, you plot grouped bar chart.
No-Correlation example
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# Creating a sample data frame import pandas as pd ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN'] DataValues=[ [480, 28, 'M', 610000, 'Yes'], [480, 42, 'M',140000, 'No'], [480, 29, 'F',420000, 'No'], [490, 30, 'M',420000, 'No'], [500, 27, 'M',420000, 'No'], [510, 34, 'F',190000, 'No'], [550, 24, 'M',330000, 'Yes'], [560, 34, 'M',160000, 'Yes'], [560, 25, 'F',300000, 'Yes'], [570, 34, 'M',450000, 'Yes'], [590, 30, 'F',140000, 'Yes'], [600, 33, 'M',600000, 'Yes'], [600, 22, 'M',400000, 'Yes'], [600, 25, 'F',490000, 'Yes'], [610, 32, 'M',120000, 'Yes'], [630, 29, 'F',360000, 'Yes'], [630, 30, 'M',480000, 'Yes'], [660, 29, 'F',460000, 'Yes'], [700, 32, 'M',470000, 'Yes'], [740, 28, 'M',400000, 'Yes']] #Create the Data Frame LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(LoanData.head()) ################################################# # Cross tabulation between GENDER and APPROVE_LOAN CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN']) print(CrosstabResult) # Grouped bar chart between GENDER and APPROVE_LOAN %matplotlib inline # only needed for jupyter notebook CrosstabResult.plot.bar() |
Sample Output

If the bars of the category “M” is similar to the bars of the category “F”, then you can say the GENDER and APPROVE_LOAN are NOT correlated.
The reason behind it is simple. If the bars are similar, that means if we change the gender, we cannot say that the loans are more approved or less approved, the ratio of approval Vs non-approval is the same for both the genders.
If the grouped bars are of different length for each category, then the variables are correlated to each other
Correlated variables example
Consider another scenario of the same data shown below, here the ratios of approval vs non-approval of loans are different for category “M” and “F”. Hence, you can say that changing the gender will impact the loan approval. Hence, there is a correlation between these two variables.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# Creating a sample data frame import pandas as pd ColumnNames=['CIBIL','AGE','GENDER' ,'SALARY', 'APPROVE_LOAN'] DataValues=[ [480, 28, 'M', 610000, 'Yes'], [480, 42, 'M',140000, 'No'], [480, 29, 'M',420000, 'No'], [490, 30, 'M',420000, 'No'], [500, 27, 'M',420000, 'No'], [510, 34, 'F',190000, 'No'], [550, 24, 'M',330000, 'Yes'], [560, 34, 'M',160000, 'No'], [560, 25, 'F',300000, 'Yes'], [570, 34, 'M',450000, 'Yes'], [590, 30, 'F',140000, 'Yes'], [600, 33, 'F',600000, 'Yes'], [600, 22, 'M',400000, 'No'], [600, 25, 'F',490000, 'Yes'], [610, 32, 'F',120000, 'Yes'], [630, 29, 'F',360000, 'Yes'], [630, 30, 'F',480000, 'Yes'], [660, 29, 'F',460000, 'Yes'], [700, 32, 'M',470000, 'Yes'], [740, 28, 'M',400000, 'Yes']] #Create the Data Frame LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(LoanData.head()) ######################################################### # Cross tabulation between GENDER and APPROVE_LOAN CrosstabResult=pd.crosstab(index=LoanData['GENDER'],columns=LoanData['APPROVE_LOAN']) print(CrosstabResult) # Grouped bar chart between GENDER and APPROVE_LOAN CrosstabResult.plot.bar(figsize=(7,4), rot=0) |
Sample Output

Now, here you can see the difference in the ratios! Simply put, your loan will get approved if you are Female! And if you are a Male then there are 50/50 chances of approval. Gender affects the approval rate. Hence, gender and loan approval are correlated here.

How to visualize distribution of deveice_id and state