How to measure the correlation between a numeric and a categorical variable in Python

This scenario can happen when you are doing regression or classification in machine learning.

Regression: The target variable is numeric and one of the predictors is categorical
Classification: The target variable is categorical and one of the predictors in numeric

In both these cases, the strength of the correlation between the variables can be measured using ANOVA test.

ANOVA stands for Analysis Of Variance. So, basically this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box-plot as well.

Below items must be remembered about ANOVA hypothesis test

Null hypothesis(H0): The variables are not correlated with each other
P-value: The probability of Null hypothesis being true
Accept Null hypothesis if P-value>0.05. Means variables are NOT correlated
Reject Null hypothesis if P-value<0.05. Means variables are correlated

In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.

# Generating sample data
import pandas as pd
ColumnNames=['FuelType','CarPrice']
DataValues= [[  'Petrol',   2000],
             [  'Petrol',   2100],
             [  'Petrol',   1900],
             [  'Petrol',   2150],
             [  'Petrol',   2100],
             [  'Petrol',   2200],
             [  'Petrol',   1950],
             [  'Diesel',   2500],
             [  'Diesel',   2700],
             [  'Diesel',   2900],
             [  'Diesel',   2850],
             [  'Diesel',   2600],
             [  'Diesel',   2500],
             [  'Diesel',   2700],
             [  'CNG',   1500],
             [  'CNG',   1400],
             [  'CNG',   1600],
             [  'CNG',   1650],
             [  'CNG',   1600],
             [  'CNG',   1500],
             [  'CNG',   1500]
        
           
           ]
#Create the Data Frame
CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(CarData.head())
########################################################
# f_oneway() function takes the group data as input and 
# returns F-statistic and P-value
from scipy.stats import f_oneway

# Running the one-way anova test between CarPrice and FuelTypes
# Assumption(H0) is that FuelType and CarPrices are NOT correlated

# Finds out the Prices data for each FuelType as a list
CategoryGroupLists=CarData.groupby('FuelType')['CarPrice'].apply(list)

# Performing the ANOVA test
# We accept the Assumption(H0) only when P-Value &gt; 0.05
AnovaResults = f_oneway(*CategoryGroupLists)
print('P-Value for Anova is: ', AnovaResults[1])

# Generating sample data

import pandas as pd

ColumnNames=['FuelType','CarPrice']

DataValues= [[ 'Petrol', 2000],

[ 'Petrol', 2100],

[ 'Petrol', 1900],

[ 'Petrol', 2150],

[ 'Petrol', 2100],

[ 'Petrol', 2200],

[ 'Petrol', 1950],

[ 'Diesel', 2500],

[ 'Diesel', 2700],

[ 'Diesel', 2900],

[ 'Diesel', 2850],

[ 'Diesel', 2600],

[ 'Diesel', 2500],

[ 'Diesel', 2700],

[ 'CNG', 1500],

[ 'CNG', 1400],

[ 'CNG', 1600],

[ 'CNG', 1650],

[ 'CNG', 1600],

[ 'CNG', 1500],

[ 'CNG', 1500]

]

#Create the Data Frame

CarData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(CarData.head())

########################################################

# f_oneway() function takes the group data as input and

# returns F-statistic and P-value

from scipy.stats import f_oneway

# Running the one-way anova test between CarPrice and FuelTypes

# Assumption(H0) is that FuelType and CarPrices are NOT correlated

# Finds out the Prices data for each FuelType as a list

CategoryGroupLists=CarData.groupby('FuelType')['CarPrice'].apply(list)

# Performing the ANOVA test

# We accept the Assumption(H0) only when P-Value > 0.05

AnovaResults = f_oneway(*CategoryGroupLists)

print('P-Value for Anova is: ', AnovaResults[1])

Sample Output

As the output of the P-value is almost zero, hence, we reject H0. Which means the variables are correlated with each other.

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

6 thoughts on “How to measure the correlation between a numeric and a categorical variable in Python”

Lorenzo
April 9, 2021 at 1:04 pm

Hello Farukh, congrats for this blog, I find it very useful.
I Have a question about p-value… how do I interpret p-value= 0.0 in the evaluation of categorical vs continuous correlation?
Thank you,
Lorenzo

1. Farukh Hashmi
  April 16, 2021 at 4:41 am
  
  The Null hypothesis in the ANOVA test is
  Ho: Two variables are not correlated.
  Hence, if the p-value comes as 0, we will reject H0 and say the variables are correlated with each other.
  
Apurv
August 26, 2021 at 8:07 pm

Sir can you explain what does *CategoryGroupLists in AnovaResults = f_oneway(*CategoryGroupLists) code actually doing?

1. Farukh Hashmi
  August 30, 2021 at 5:26 pm
  
  Hi Apurv,
  
  *CategoryGroupLists creates a list of continuous values for each category to be passed to f_oneway() ANOVA function.
  in the above scenario, it will generate the list of prices for the Petrol, Diesel and CNG categories.
  like this [ [2000, 2100, 1900, 2150, 2100, 2200,1950] ,
  [2500, 2700, 2900, 2850, 2600, 2500, 2700],
  [1500, 1400, 1600, 1650, 1600, 1500, 1500] ]
  
Naf
December 19, 2022 at 8:21 am

Hi Farukh, what an excellent video regrouping the different terminologies for hypothesis testing, be it for machine learning or statistics, thank you very much!

1. Farukh Hashmi
  December 19, 2022 at 9:21 am
  
  Hi Naf!
  
  Thank you for the kind words!
  I am happy it was useful. Cheers!

6 thoughts on “How to measure the correlation between a numeric and a categorical variable in Python”

Leave a Reply! Cancel Reply