When you are trying to understand about the central tendency of a numeric dataset, the median is a better way, for example, when the questions are like below:
1. What is the average age of people living in an area?
2. What is the average placement salary of students from an institute?
A better average to use will be median. Because Mean will take outliers into account and it may mislead you to take wrong decisions. The reason median will perform better here is that the way it is calculated
Median divides the data into two equal halves
Let us understand how the median is calculated.
Step-1: Arrange the data in increasing order
Step-2: Find the value which divides the data into two equal parts
Step-3: If the number of values is even, then take the mean of two values which are in the middle.
Understanding the calculation of median in R
Example-1 (Odd number of values ): median of numbers 1:11 is 6, there are 5 values on the left side and 5 values on the right side.
1 2 3 4 |
# odd number of values DataPoints=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) median(DataPoints) |
Example-2 (Even number of values): median of number 1:10 is 5.5 calculated as (5+6)/2. Once again 5 values on the left side and 5 values on the right side
1 2 3 4 |
# Even number of values DataPoints=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) median(DataPoints) |
Median is not affected by outliers
Due to the way the Median is calculated, it does not get affected by outliers in data.
For example, consider below data points with an outlier in it.
1 |
1, 2, 3,4, 5, 6, 7, 8, 9, 10, 100 |
The median value is ‘6’ which is dividing the whole dataset into two equal halves. five values are on the left side and five values on the right side.
If we calculate the mean value for the same dataset then it will come as ‘14.09’ because of the outlier 100 being present
Business Scenario:
You have developed a dating app
You want to run a campaign to advertise it at a location
The average age of people living in that area is 49
Will you launch the app?
Let us dig a little deeper by computing mean and median both
Mean Age: 49 Years
Median Age: 25 Years
If you look at the mean value which is 49 then the answer is NO but if you look at median which is 25, then you get to understand that at least 50% of people living in this area have age less than 25,
hence, it will be wise to launch the app in this area!
1 2 3 4 5 6 7 |
# Creating a simple vector of Age values with positive outliers Age=c(21,22,22,23,25,27,28,132,144) mean(Age) median(Age) |
1 2 3 4 5 6 |
# Generating the density plot plot(density(Age), xlab='Age', ylab='Age Density') # Generating the Mean value line abline(v=mean(Age),col='red', lwd=2) abline(v=median(Age),col='blue', lwd=2) |
Data Science Tip: Missing Value Treatment for Machine Learning
While preparing data for machine learning one critical step is to find and replace the missing
Use of mean can introduce bias in the data since the mean value does not always represent the central tendency or the general pattern of data
In such cases, we should use the median value for each of those numeric columns, which is a better indicator of central tendency.
Conclusion:
- Median should be used as a measure for average trend analysis
- Median does not get affected by outliers in data
- Missing values should not be imputed by Mean, instead of that Median value can be used