How to measure the correlation between a numeric and a categorical variable in Python

This scenario can happen when you are doing regression or classification in machine learning.

  • Regression: The target variable is numeric and one of the predictors is categorical
  • Classification: The target variable is categorical and one of the predictors in numeric

In both these cases, the strength of the correlation between the variables can be measured using ANOVA test.

ANOVA stands for Analysis Of Variance. So, basically this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box-plot as well.

Below items must be remembered about ANOVA hypothesis test

  • Null hypothesis(H0): The variables are not correlated with each other
  • P-value: The probability of Null hypothesis being true
  • Accept Null hypothesis if P-value>0.05. Means variables are NOT correlated
  • Reject Null hypothesis if P-value<0.05. Means variables are correlated

In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.

Sample Output

ANOVA test in Python
ANOVA test in Python

As the output of the P-value is almost zero, hence, we reject H0. Which means the variables are correlated with each other.

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

4 thoughts on “How to measure the correlation between a numeric and a categorical variable in Python”

  1. Hello Farukh, congrats for this blog, I find it very useful.
    I Have a question about p-value… how do I interpret p-value= 0.0 in the evaluation of categorical vs continuous correlation?
    Thank you,
    Lorenzo

    1. The Null hypothesis in the ANOVA test is
      Ho: Two variables are not correlated.
      Hence, if the p-value comes as 0, we will reject H0 and say the variables are correlated with each other.

  2. Sir can you explain what does *CategoryGroupLists in AnovaResults = f_oneway(*CategoryGroupLists) code actually doing?

    1. Hi Apurv,

      *CategoryGroupLists creates a list of continuous values for each category to be passed to f_oneway() ANOVA function.
      in the above scenario, it will generate the list of prices for the Petrol, Diesel and CNG categories.
      like this [ [2000, 2100, 1900, 2150, 2100, 2200,1950] ,
      [2500, 2700, 2900, 2850, 2600, 2500, 2700],
      [1500, 1400, 1600, 1650, 1600, 1500, 1500] ]

Leave a Reply!

Your email address will not be published. Required fields are marked *