This scenario can happen when you are doing regression or classification in machine learning.
- Regression: The target variable is numeric and one of the predictors is categorical
- Classification: The target variable is categorical and one of the predictors in numeric
In both these cases, the strength of the correlation between the variables can be measured using ANOVA test.
ANOVA stands for Analysis Of Variance. So, basically this test measures if there are any significant differences between the means of the values of the numeric variable for each categorical value. This is something that you can visualize using a box-plot as well.
Below items must be remembered about ANOVA hypothesis test
- Null hypothesis(H0): The variables are not correlated with each other
- P-value: The probability of Null hypothesis being true
- Accept Null hypothesis if P-value>0.05. Means variables are NOT correlated
- Reject Null hypothesis if P-value<0.05. Means variables are correlated
In the below example, we are trying to measure if there is any correlation between FuelType on CarPrices. Here FuelType is a categorical predictor and CarPrices is the numeric target variable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
# Generating sample data import pandas as pd ColumnNames=['FuelType','CarPrice'] DataValues= [[ 'Petrol', 2000], [ 'Petrol', 2100], [ 'Petrol', 1900], [ 'Petrol', 2150], [ 'Petrol', 2100], [ 'Petrol', 2200], [ 'Petrol', 1950], [ 'Diesel', 2500], [ 'Diesel', 2700], [ 'Diesel', 2900], [ 'Diesel', 2850], [ 'Diesel', 2600], [ 'Diesel', 2500], [ 'Diesel', 2700], [ 'CNG', 1500], [ 'CNG', 1400], [ 'CNG', 1600], [ 'CNG', 1650], [ 'CNG', 1600], [ 'CNG', 1500], [ 'CNG', 1500] ] #Create the Data Frame CarData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(CarData.head()) ######################################################## # f_oneway() function takes the group data as input and # returns F-statistic and P-value from scipy.stats import f_oneway # Running the one-way anova test between CarPrice and FuelTypes # Assumption(H0) is that FuelType and CarPrices are NOT correlated # Finds out the Prices data for each FuelType as a list CategoryGroupLists=CarData.groupby('FuelType')['CarPrice'].apply(list) # Performing the ANOVA test # We accept the Assumption(H0) only when P-Value > 0.05 AnovaResults = f_oneway(*CategoryGroupLists) print('P-Value for Anova is: ', AnovaResults[1]) |
Sample Output

As the output of the P-value is almost zero, hence, we reject H0. Which means the variables are correlated with each other.
Hello Farukh, congrats for this blog, I find it very useful.
I Have a question about p-value… how do I interpret p-value= 0.0 in the evaluation of categorical vs continuous correlation?
Thank you,
Lorenzo
The Null hypothesis in the ANOVA test is
Ho: Two variables are not correlated.
Hence, if the p-value comes as 0, we will reject H0 and say the variables are correlated with each other.
Sir can you explain what does *CategoryGroupLists in AnovaResults = f_oneway(*CategoryGroupLists) code actually doing?
Hi Apurv,
*CategoryGroupLists creates a list of continuous values for each category to be passed to f_oneway() ANOVA function.
in the above scenario, it will generate the list of prices for the Petrol, Diesel and CNG categories.
like this [ [2000, 2100, 1900, 2150, 2100, 2200,1950] ,
[2500, 2700, 2900, 2850, 2600, 2500, 2700],
[1500, 1400, 1600, 1650, 1600, 1500, 1500] ]