A Scatter plot is the chart used when you want to visualize the relationship between two continuous variables in data. Typically used in Supervised ML(Regression). Where the target variable is a continuous variable. So if you want to check which continuous predictor has a clear relationship with the target variable, then you look at the scatter plots.
Consider the below scenario Here the target variable is “Weight” and we are trying to predict it based on the number of hours a person works out at the gym and the number of calories they consume in a day.
If you plot the scatter chart between weight and calories, you can see an increasing trend. We can easily deduce from this graph that, if the calory intake increases, then the weight also increases. This is known as a positive correlation. We can see a “clear trend”, hence, there is a relationship between weight and calories. In other words, the predictor variable calories can be used to predict weight.
Similarly, you can see there is a clear decreasing trend between Weight and the Hours, It means if the number of hours at the gym increases, the weight decreases. This is known as a Negative correlation. Again, there is a “clear trend”, hence there is a relationship between weight and hours. In other words, hours can be used to predict weight.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import pandas as pd ColumnNames=['Hours','Calories', 'Weight'] DataValues=[[ 1.0, 2500, 95], [ 2.0, 2000, 85], [ 2.5, 1900, 83], [ 3.0, 1850, 81], [ 3.5, 1600, 80], [ 4.0, 1500, 78], [ 5.0, 1500, 77], [ 5.5, 1600, 80], [ 6.0, 1700, 75], [ 6.5, 1500, 70], [ 1.3, 2200, 90], [ 2.2, 1800, 87], [ 3.2, 1750, 81], [ 3.7, 1600, 80], [ 4.2, 1550, 75], [ 5.1, 1500, 79], [ 5.8, 1650, 82], [ 6.3, 1700, 72], [ 6.5, 1400, 69] ] # Create the Data Frame GymData=pd.DataFrame(data=DataValues,columns=ColumnNames) print(GymData.head()) ######################################################## # Positive correlation scatter plot %matplotlib inline GymData.plot.scatter(x='Calories', y='Weight', marker='o', figsize=(7,5)) # Negative correlation scatter plot GymData.plot.scatter(x='Hours', y='Weight', marker='o', figsize=(7,5)) |
Sample Output:


What if there is no clear trend in the scatter plot?
If you cannot see any kind of trend(increasing/decreasing) in the scatter plot, that means the variables are not correlated with each other. Hence, it will not be possible to create a model using those two variables.
for example, look at below scatter plot between the prices of diamonds and their depth. You cannot see a clear increasing or decreasing trend, hence, no model can be created between depth and price. In other words, depth cannot be used to predict the price.
