Stats 101: Why Mean is Misleading?

Business Scenario:

You have developed a dating app.
You want to run a campaign to advertise it at a location.
The average age of people living in that area is 49 Years.

Will you launch the app?

The intuitive answer is No! Simply because of the ‘average’ age of the people living in that area is 49 Years and it will not make sense to sell them a dating app!

After canceling the event you looked at the age data closely and found something like this…

Most of the population is young! But there are 2 cases which are abruptly different from all the other ages. Maybe these are valid cases! Few of these uncles and aunties are simply not giving up! or most probably this is a data error.

These values are known as Outliers. Since they are exceptionally high or low as compared to most of the values in the group

Due to these outliers in the data, the overall average age of people living in this area increased to 49. However, we can easily notice most of them are youngsters with age below 30 and a dating application will be loved by all of them!

So the business decision based on the mean value of age was misleading and the reason is simple.

Mean is affected by outliers

It is because of the mathematical definition of mean, which says, in order to calculate the mean value, sum all the numbers and divide it by the total number of items.

For example, mean of first 3 numbers are calculated as below which is 2.

If one of these numerator values are increased or decreased abruptly the mean value then gets shifted towards it.

A simple scenario, if 3 becomes 300 then the mean shifts from 2 to 101.

mean of three numbers is 101 due to outlier 300

And this is exactly why you cannot trust mean for approximating the average trend for anything and one must always doubt statements like below.

‘Average placement salary of students from our institute is $120,000’
‘The mileage of our bike is 50 Kmpl’

Since these are taking into account the outliers in the calculation and hence may misguide you for taking a decision.

However, if you think about the above scenarios of bike mileage and the average salary of students from an institute. The idea is to give a rough estimate about the ‘overall’ on an average mileage and ‘overall’ average salary of students but outliers are taken into account to inflate the figures to catch attention and influence decision making.

The quoted bike mileage is an outlier in itself. Since the test is made under ideal conditions of the road and with constant speed and with the only driver riding it. If real conditions are taken into consideration the mileage will be much lower than quoted as we normally experience!

Does that imply Mean is useless?

Absolutely no!
Mean is important in those scenarios where each value in the data must be taken into calculations.
For example, while finding the average sales for a year of any organization, we must take all the sales values into account because it is calculated for sales reporting purposes.

Understanding the outlier effect on mean values graphically using R

Creating the age dataset with outliers on the higher side

# Creating a simple vector of Age values with positive outliers

Age=c(21,22,22,23,25,27,28,132,144)

mean(Age)

# Creating a simple vector of Age values with positive outliers

Age=c(21,22,22,23,25,27,28,132,144)

mean(Age)

Visualizing the mean values getting shifted towards the right side because of the presence of outliers in right. this is also known as positive skew.

# Generating the density plot

plot(density(Age), xlab='Age', ylab='Age Density')

# Generating the density plot

plot(density(Age), xlab='Age', ylab='Age Density')

# Generating the Mean value line

abline(v=mean(Age),col='red', lwd=2)

# Generating the Mean value line

abline(v=mean(Age),col='red', lwd=2)

Creating the age dataset with outliers on the lower side

# Creating a simple vector of Age values with negative outliers

Age=c(-50, -30, 21,22,22,23,25,27,28,29,30)

mean(Age)

# Creating a simple vector of Age values with negative outliers

Age=c(-50, -30, 21,22,22,23,25,27,28,29,30)

mean(Age)

Visualizing the mean values getting shifted towards the left side because of the presence of outliers in left. this is also known as negative skew.

# Generating the density plot

plot(density(Age), xlab='Age', ylab='Age Density')

# Generating the density plot

plot(density(Age), xlab='Age', ylab='Age Density')

# Generating the Mean value line 

abline(v=mean(Age),col='red', lwd=2)

# Generating the Mean value line

abline(v=mean(Age),col='red', lwd=2)

Data Science Tip!
Use Median instead of mean for missing values treatment

While we prepare data for machine learning one important step is to find and replace the missing values in each of the columns. One of the textbook approaches is to replace the missing values with the mean value of that column.
This can introduce bias in the data since the mean value does not always represent the central tendency or the general pattern of data
An alternate way is to use the median value, which is a better indicator of central tendency.

Conclusion:

Mean should not be used as a measure for average trend
Mean gets affected by outliers in data
Missing values should not be imputed by Mean, instead of that Median value can be used
Mean can be used where every data point needs to be taken into the calculation

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Business Scenario:

Does that imply Mean is useless?

Understanding the outlier effect on mean values graphically using R

Data Science Tip!Use Median instead of mean for missing values treatment

Conclusion:

Leave a Reply! Cancel Reply

Data Science Tip!
Use Median instead of mean for missing values treatment