How to measure the correlation between two numeric variables in Python

The correlation value is used to measure the strength and nature of the relationship between two continuous variables while doing feature selection for machine learning.

This is commonly used in Regression, where the target variable is continuous. So, the predictor can be either continuous or categorical. When both of the variables are continuous, then the correlation value can be used to measure the strength of the relationship between those two variables.

You can use scatter plots to visualize the relationship and correlation value to measure the strength.

Important things to remember about correlation value

Correlation value can range from -1 to +1
Positive correlation ranges from 0 to +1, zero is excluded
Negative correlation ranges from -1 to 0, zero is excluded
No correlation is the value zero {0}

Practically, if the magnitude of the correlation is >0.5 then the relationship is strong enough to create a meaningful predictive model between those two variables. However, there will be many datasets, in which no predictor variable will have a magnitude of correlation of more than 0.5, in those cases, you will have to work with whatever best is available.

A Negative correlation does not mean its bad! It is just that the two variables are inversely proportional to each other. When one value increases the other decreases.

A Positive correlation means that the variables are directly proportional to each other. When one value increases the other also increases.

In below sample data, the target variable is “Weight” and the predictors are “Hours” spent at the gym and the “Calories” consumed in the day. All of these are continuous variables, hence correlation can be used to measure the strength of the relationship between these.

Correlation value can be measured using corr() function of a pandas data frame in python.

import pandas as pd
ColumnNames=['Hours','Calories', 'Weight']
DataValues=[[  1.0,   2500,   95],
             [  2.0,   2000,   85],
             [  2.5,   1900,   83],
             [  3.0,   1850,   81],
             [  3.5,   1600,   80],
             [  4.0,   1500,   78],
             [  5.0,   1500,   77],
             [  5.5,   1600,   80],
             [  6.0,   1700,   75],
             [  6.5,   1500,   70],
             [  1.3,   2200,   90],
             [  2.2,   1800,   87],
             [  3.2,   1750,   81],
             [  3.7,   1600,   80],
             [  4.2,   1550,   75],
             [  5.1,   1500,   79],
             [  5.8,   1650,   82],
             [  6.3,   1700,   72],
             [  6.5,   1400,   69]          
           
           ]
#Create the Data Frame
GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)
print(GymData.head())
########################################################
# Measuring correlation between two variables
GymData[['Calories','Weight']].corr()

# Visualizing correlation between two variables using scatter plot
GymData.plot.scatter(x='Calories', y='Weight', marker='o', figsize=(7,5))

########################################################
# Measuring correlation between two variables
GymData[['Hours','Weight']].corr()

# Visualizing correlation between two variables using scatter plot
GymData.plot.scatter(x='Hours', y='Weight', marker='o', figsize=(7,5))

import pandas as pd

ColumnNames=['Hours','Calories', 'Weight']

DataValues=[[ 1.0, 2500, 95],

[ 2.0, 2000, 85],

[ 2.5, 1900, 83],

[ 3.0, 1850, 81],

[ 3.5, 1600, 80],

[ 4.0, 1500, 78],

[ 5.0, 1500, 77],

[ 5.5, 1600, 80],

[ 6.0, 1700, 75],

[ 6.5, 1500, 70],

[ 1.3, 2200, 90],

[ 2.2, 1800, 87],

[ 3.2, 1750, 81],

[ 3.7, 1600, 80],

[ 4.2, 1550, 75],

[ 5.1, 1500, 79],

[ 5.8, 1650, 82],

[ 6.3, 1700, 72],

[ 6.5, 1400, 69]

]

#Create the Data Frame

GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)

print(GymData.head())

########################################################

# Measuring correlation between two variables

GymData[['Calories','Weight']].corr()

# Visualizing correlation between two variables using scatter plot

GymData.plot.scatter(x='Calories', y='Weight', marker='o', figsize=(7,5))

########################################################

# Measuring correlation between two variables

GymData[['Hours','Weight']].corr()

# Visualizing correlation between two variables using scatter plot

GymData.plot.scatter(x='Hours', y='Weight', marker='o', figsize=(7,5))

Positive Correlation Sample Output

Positive Correlation and scatter plot between two variables in python — Positive Correlation value and scatter plot between two variables in python

Negative Correlation Sample Output

Negative correlation and scatter plot between two variables in python

Measuring correlation for all the variables at once

# Measuring correlation for all variables
GymData.corr()

1 2	# Measuring correlation for all variables GymData.corr()

Sample Output:

Correlation matrix for all variables in the pandas data frame

This is a correlation matrix. The values represent all variables vs all variables. You can focus on only one column or row of the target variable. In the above diagram, it is Weight, so you can see the correlation of Hours and Calories with Weight in the last row.

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

How to measure the correlation between two numeric variables in Python

Measuring correlation for all the variables at once

1 thought on “How to measure the correlation between two numeric variables in Python”

Leave a Reply! Cancel Reply