The correlation value is used to measure the strength and nature of the relationship between two continuous variables while doing feature selection for machine learning.
This is commonly used in Regression, where the target variable is continuous. So, the predictor can be either continuous or categorical. When both of the variables are continuous, then the correlation value can be used to measure the strength of the relationship between those two variables.
You can use scatter plots to visualize the relationship and correlation value to measure the strength.
Important things to remember about correlation value
- Correlation value can range from -1 to +1
- Positive correlation ranges from 0 to +1, zero is excluded
- Negative correlation ranges from -1 to 0, zero is excluded
- No correlation is the value zero {0}
Practically, if the magnitude of the correlation is >0.5 then the relationship is strong enough to create a meaningful predictive model between those two variables. However, there will be many datasets, in which no predictor variable will have a magnitude of correlation of more than 0.5, in those cases, you will have to work with whatever best is available.
A Negative correlation does not mean its bad! It is just that the two variables are inversely proportional to each other. When one value increases the other decreases.
A Positive correlation means that the variables are directly proportional to each other. When one value increases the other also increases.
In below sample data, the target variable is “Weight” and the predictors are “Hours” spent at the gym and the “Calories” consumed in the day. All of these are continuous variables, hence correlation can be used to measure the strength of the relationship between these.
Correlation value can be measured using corr() function of a pandas data frame in python.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
import pandas as pd ColumnNames=['Hours','Calories', 'Weight'] DataValues=[[ 1.0, 2500, 95], [ 2.0, 2000, 85], [ 2.5, 1900, 83], [ 3.0, 1850, 81], [ 3.5, 1600, 80], [ 4.0, 1500, 78], [ 5.0, 1500, 77], [ 5.5, 1600, 80], [ 6.0, 1700, 75], [ 6.5, 1500, 70], [ 1.3, 2200, 90], [ 2.2, 1800, 87], [ 3.2, 1750, 81], [ 3.7, 1600, 80], [ 4.2, 1550, 75], [ 5.1, 1500, 79], [ 5.8, 1650, 82], [ 6.3, 1700, 72], [ 6.5, 1400, 69] ] #Create the Data Frame GymData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(GymData.head()) ######################################################## # Measuring correlation between two variables GymData[['Calories','Weight']].corr() # Visualizing correlation between two variables using scatter plot GymData.plot.scatter(x='Calories', y='Weight', marker='o', figsize=(7,5)) ######################################################## # Measuring correlation between two variables GymData[['Hours','Weight']].corr() # Visualizing correlation between two variables using scatter plot GymData.plot.scatter(x='Hours', y='Weight', marker='o', figsize=(7,5)) |
Positive Correlation Sample Output

Negative Correlation Sample Output

Measuring correlation for all the variables at once
1 2 |
# Measuring correlation for all variables GymData.corr() |
Sample Output:

This is a correlation matrix. The values represent all variables vs all variables. You can focus on only one column or row of the target variable. In the above diagram, it is Weight, so you can see the correlation of Hours and Calories with Weight in the last row.
Thank you for your explaination. It’s now clear for me.