Stats 101: Measures of Location [Quartiles]

Measures of location is a combination of values for a data which can summarise its distribution. Quartiles is one such measure that divides the data into four equal parts.

In order to understand how the data is distributed, we can arrange the data in increasing order and look at the values at certain points.

For example, let us consider a set of values and find quartiles for it.

1 , 2, 3, 4, 5, 6, 7, 8, 9, 10,11
Quartile1 (Q1): 3.5
Quartile2 (Q2): 6
Quartile3 (Q3): 8.5

1 , 2, 3, 4, 5, 6, 7, 8, 9, 10,11

Quartile1 (Q1): 3.5

Quartile2 (Q2): 6

Quartile3 (Q3): 8.5

Definition of Quartiles :

If you want to divide this data into four equal parts, you will have to choose three points. These points are known as quartiles.

Quartiles are three points in data which divides it in four equal parts

How to determine quartiles?

Sort the data in increasing order.
Find the point which divides the data into two equal halves. Also known as the median value. This is the second Quartile, abbreviated as Q2.
Consider the first half of the data and find the median for this subset, this is the first Quartile.
Consider the second half of the data and find the median for this subset, this is the third Quartile, abbreviated as Q3.

Let us visualize it graphically to understand it better.

Step-1: Arranging the data points in increasing order and find that point which divides it into two equal halves.

In this case, it turns out to be ‘6’, since it divides the data into two halves with five numbers on each side. Hence Q2 becomes 6 for this data, also known as the median value.

Step-1: Sort the data and find the median, which will be Q2

The second quartile (Q2) is also known as the median. It divides the data into two equal parts

Step-2: Consider the first half of the data and find its median value

In this scenario, the median of the first six numbers will be 3.5. Hence Q1 becomes 3.5.

Step-2: Consider the first half of the data and find its median which will be Q1

Step-3: Consider the second half of the data and find its median value

In this scenario, the median of last six numbers will be 8.5. Hence Q3 becomes 8.5.

Step-3 Consider the third half of the data and find its median which will be Q3

Collectively Q1, Q2, and Q3 are known as Quartiles. These points divide the data into four equal halves.

Why Quartiles are important?

Quartiles are great measures of location. It gives you the bird’s eye view of the distribution of a dataset.

Potential outliers are detected in usually before Q1 and after Q3. Because this is where extremely low or extremely high values will be present.

The summary of any numeric dataset consists of six most important points listed below

Min, Max, Mean, Q1, Q2, Q3

Together these values help you to understand how the data is distributed.

You can spot an abnormality in data easily by looking at the mean and median. If they are equal, it means the data is ideally distributed. If there is a huge gap between mean and median then the data is skewed and outliers are present.

Similarly, if the Q1 and Q2 contain a large gap, the data might be scattered and will have high variance and high standard deviation.

Inter Quartile Range

If you need a dataset free of outliers in one go. You can simply choose the data between Q1 and Q3. Which is 50% of the overall data.

The difference between Q1 and Q3 is also known as Inter Quartile Range abbreviated as ‘IQR’

In our example Q1=3.5 , Q3=8.5, hence the IQR will be 8.5-3.5 = 5.

The difference between Q1 and Q3 is also known as the Inter Quartile Range

Inter Quartile Range is the difference between Q1 and Q3

What if I divide the data into more parts?

Yes! it can be done. If you divide the data into more parts then each part will become a measure of location. There are special cases listed below.

Deciles: Divide the data into 10 equal parts
Percents: Divide the data into 100 equal parts

Data Science Tip: How to use quartiles?

Data exploration: Look for the difference between Q1 and Q3 to understand if the data is scattered or not. If Q1 and Q3 have a high difference that means the data is scattered.
Residuals/Errors of Predictions: The errors of test data gives a clue about the performance of your predictive model. Look at the distribution of errors. Typically the mean will be higher than the median due to outliers. The position of Q1 and Q3 can help to understand how 50% of all the errors are distributed.

How to find quartiles in R?

R has a function called ‘quantile()’ which helps to find the quartile values for a dataset. Look at the example below

# Creating a range of first 11 numbers
DataPoints=c(1:11)

# Finding Quartiles
quantile(DataPoints)

# Finding Quartiles and other summary statistics
summary(DataPoints)

# Creating a range of first 11 numbers

DataPoints=c(1:11)

# Finding Quartiles

quantile(DataPoints)

# Finding Quartiles and other summary statistics

summary(DataPoints)

Output in R:

How to find quartiles in Python?

Python contains multiple functions for finding quartiles. I recommend using the method from the pandas library since most of the data pre-processing is done using pandas.

#importing numpy and pandas libray
import numpy as np
import pandas as pd

#Creating array of 11 points
DataPoints=np.array(range(1,12))
print(DataPoints)

#Creating a Data Frame with one column
SampleDataFrame=pd.DataFrame(DataPoints, columns=['DataPoints'])
print(SampleDataFrame)

#Finding Quartiles and other summary statistics
SampleDataFrame.DataPoints.describe()

#importing numpy and pandas libray

import numpy as np

import pandas as pd

#Creating array of 11 points

DataPoints=np.array(range(1,12))

print(DataPoints)

#Creating a Data Frame with one column

SampleDataFrame=pd.DataFrame(DataPoints, columns=['DataPoints'])

print(SampleDataFrame)

#Finding Quartiles and other summary statistics

SampleDataFrame.DataPoints.describe()

Conclusion:

Quartiles divide the data into four equal parts by finding three points Q1, Q2, and Q3
Q2 is also known as the median
The difference between Q1 and Q3 is also known as the Inter Quartile Range (IQR)
Min, Max, Mean, Q1, Q2, Q3. Together these values help you to understand how the data is distributed.

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com