Business Scenario:
You have developed a dating app. 
You want to run a campaign to advertise it at a location.
The average age of people living in that area is 49 Years.
Will you launch the app?
The intuitive answer is No! Simply because of the ‘average’ age of the people living in that area is 49 Years and it will not make sense to sell them a dating app!
After canceling the event you looked at the age data closely and found something like this…

Most of the population is young! But there are 2 cases which are abruptly different from all the other ages. Maybe these are valid cases! Few of these uncles and aunties are simply not giving up! or most probably this is a data error.
These values are known as Outliers. Since they are exceptionally high or low as compared to most of the values in the group
Due to these outliers in the data, the overall average age of people living in this area increased to 49. However, we can easily notice most of them are youngsters with age below 30 and a dating application will be loved by all of them!
So the business decision based on the mean value of age was misleading and the reason is simple.
Mean is affected by outliers
It is because of the mathematical definition of mean, which says, in order to calculate the mean value, sum all the numbers and divide it by the total number of items.
For example, mean of first 3 numbers are calculated as below which is 2.

If one of these numerator values are increased or decreased abruptly the mean value then gets shifted towards it.
A simple scenario, if 3 becomes 300 then the mean shifts from 2 to 101.

And this is exactly why you cannot trust mean for approximating the average trend for anything and one must always doubt statements like below.
- ‘Average placement salary of students from our institute is $120,000’
- ‘The mileage of our bike is 50 Kmpl’
Since these are taking into account the outliers in the calculation and hence may misguide you for taking a decision.
However, if you think about the above scenarios of bike mileage and the average salary of students from an institute. The idea is to give a rough estimate about the ‘overall’ on an average mileage and ‘overall’ average salary of students but outliers are taken into account to inflate the figures to catch attention and influence decision making.
The quoted bike mileage is an outlier in itself. Since the test is made under ideal conditions of the road and with constant speed and with the only driver riding it. If real conditions are taken into consideration the mileage will be much lower than quoted as we normally experience!
Does that imply Mean is useless?
Absolutely no!
Mean is important in those scenarios where each value in the data must be taken into calculations.
For example, while finding the average sales for a year of any organization, we must take all the sales values into account because it is calculated for sales reporting purposes.
Understanding the outlier effect on mean values graphically using R
Creating the age dataset with outliers on the higher side
| 1 2 3 4 5 | # Creating a simple vector of Age values with positive outliers Age=c(21,22,22,23,25,27,28,132,144) mean(Age) | 
Visualizing the mean values getting shifted towards the right side because of the presence of outliers in right. this is also known as positive skew.
| 1 2 3 | # Generating the density plot plot(density(Age), xlab='Age', ylab='Age Density') | 
| 1 2 3 | # Generating the Mean value line abline(v=mean(Age),col='red', lwd=2) | 

Creating the age dataset with outliers on the lower side
| 1 2 3 4 5 | # Creating a simple vector of Age values with negative outliers Age=c(-50, -30, 21,22,22,23,25,27,28,29,30) mean(Age) | 
Visualizing the mean values getting shifted towards the left side because of the presence of outliers in left. this is also known as negative skew.
| 1 2 3 | # Generating the density plot plot(density(Age), xlab='Age', ylab='Age Density') | 
| 1 2 3 | # Generating the Mean value line  abline(v=mean(Age),col='red', lwd=2) | 

Data Science Tip!
Use Median instead of mean for missing values treatment
While we prepare data for machine learning one important step is to find and replace the missing values in each of the columns. One of the textbook approaches is to replace the missing values with the mean value of that column.
This can introduce bias in the data since the mean value does not always represent the central tendency or the general pattern of data
An alternate way is to use the median value, which is a better indicator of central tendency.
Conclusion:
- Mean should not be used as a measure for average trend
- Mean gets affected by outliers in data
- Missing values should not be imputed by Mean, instead of that Median value can be used
- Mean can be used where every data point needs to be taken into the calculation


