Stats 101: Why Median is a better measure of central tendency

When you are trying to understand about the central tendency of a numeric dataset, the median is a better way, for example, when the questions are like below:

1. What is the average age of people living in an area?
2. What is the average placement salary of students from an institute?

A better average to use will be median. Because Mean will take outliers into account and it may mislead you to take wrong decisions. The reason median will perform better here is that the way it is calculated

Median divides the data into two equal halves

Let us understand how the median is calculated.
Step-1: Arrange the data in increasing order
Step-2: Find the value which divides the data into two equal parts
Step-3: If the number of values is even, then take the mean of two values which are in the middle.

Understanding the calculation of median in R

Example-1 (Odd number of values ): median of numbers 1:11 is 6, there are 5 values on the left side and 5 values on the right side.

# odd number of values
DataPoints=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

median(DataPoints)

# odd number of values

DataPoints=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

median(DataPoints)

Example-2 (Even number of values): median of number 1:10 is 5.5 calculated as (5+6)/2. Once again 5 values on the left side and 5 values on the right side

# Even number of values
DataPoints=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

median(DataPoints)

# Even number of values

DataPoints=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

median(DataPoints)

Median is not affected by outliers

Due to the way the Median is calculated, it does not get affected by outliers in data.
For example, consider below data points with an outlier in it.

1, 2, 3,4, 5, 6, 7, 8, 9, 10, 100

1	1, 2, 3,4, 5, 6, 7, 8, 9, 10, 100

The median value is ‘6’ which is dividing the whole dataset into two equal halves. five values are on the left side and five values on the right side.
If we calculate the mean value for the same dataset then it will come as ‘14.09’ because of the outlier 100 being present

Business Scenario:

You have developed a dating app
You want to run a campaign to advertise it at a location
The average age of people living in that area is 49
Will you launch the app?

Let us dig a little deeper by computing mean and median both
Mean Age: 49 Years
Median Age: 25 Years
If you look at the mean value which is 49 then the answer is NO but if you look at median which is 25, then you get to understand that at least 50% of people living in this area have age less than 25,
hence, it will be wise to launch the app in this area!

# Creating a simple vector of Age values with positive outliers
Age=c(21,22,22,23,25,27,28,132,144)

mean(Age)

median(Age)

# Creating a simple vector of Age values with positive outliers

Age=c(21,22,22,23,25,27,28,132,144)

mean(Age)

median(Age)

# Generating the density plot
plot(density(Age), xlab='Age', ylab='Age Density')

# Generating the Mean value line
abline(v=mean(Age),col='red', lwd=2)
abline(v=median(Age),col='blue', lwd=2)

# Generating the density plot

plot(density(Age), xlab='Age', ylab='Age Density')

# Generating the Mean value line

abline(v=mean(Age),col='red', lwd=2)

abline(v=median(Age),col='blue', lwd=2)

Data Science Tip: Missing Value Treatment for Machine Learning

While preparing data for machine learning one critical step is to find and replace the missing
Use of mean can introduce bias in the data since the mean value does not always represent the central tendency or the general pattern of data
In such cases, we should use the median value for each of those numeric columns, which is a better indicator of central tendency.

Conclusion:

Median should be used as a measure for average trend analysis
Median does not get affected by outliers in data
Missing values should not be imputed by Mean, instead of that Median value can be used

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Understanding the calculation of median in R

Business Scenario:

Data Science Tip: Missing Value Treatment for Machine Learning

Conclusion:

Leave a Reply! Cancel Reply