Stats 101: Measures of Spread [Min Max]

It is very important to explore the data and understand it before you can use it for solving problems.

The most intuitive way of data exploration is by looking at its spread, which means how the data points are present in this data, you can start by asking below questions.

Question: What are the minimum and maximum values? Are these values seem sensible?
Answer: Min, Max, Range

Question: Are they placed close to each other? Or each data point is far away from others?
Answer: Standard deviation and Variance

Let us understand each one of these measures in detail and how can we use them to perform very quick data validation.

Min, Max, Range

Min- The smallest value present in the data.
Max- The largest value present in the data.
Range- The Minimum and Maximum values present in the data together are known as the Range of the data.

These values can be used to quickly identify whether there is any data-capture error or not?

For example, if you see an Age column in your data and using its summary details you find that the range of Age of employees is as below for a stock trading company.

# Range of Age values of employees
Min(Age)= 10 Years
Max(Age)= 50 Years

# Range of Age values of employees

Min(Age)= 10 Years

Max(Age)= 50 Years

It can be easily seen that there is something wrong in this data because logical age would be above a threshold, say 25 Years. Hence the minimum age cannot be 10 Years! So appropriate steps could be taken to correct the invalid data.

Further, Min and Max values give a sense of the data issues. If there is too much difference between min and max value then there may be outliers present in the data.

For example, if min value= 1 and max value= 100,000 then, it looks like one of these is an outlier, either on the lower side or on the higher side.

If there a large difference between Min and Max value then there may be outliers in the data.

Also, the Range of any dataset helps us understand about the magnitude of spread. For example, consider the below statements…

“The salary of IT employees ranges from $50,000 to $100,000 per year”.

“The salary of wall street traders ranges from $250,000 to $350,000 per year”

Both the statements help to understand the magnitude and spread of typical salaries for respective jobs. One can see that traders are earning higher salaries as compared to IT employees!

However, there are few questions which pop up in mind by looking at this sample salary range.

Are these salaries evenly distributed?
Are there just a few with high salaries?
Are there just a few with low salaries?

These questions cannot be answered just by looking at the min and max values because it does not help to understand what types of values are present in between.

Range does not tell what type of values are present in between min and max

To understand this we need another measure of Spread known as Standard Deviation discussed here “Stats 101: Measures of Spread [Standard Deviation]”

Data Science Tip: Data Exploration in Machine Learning

Looking at Range helps us in various stages of machine learning.

During Data Exploration we can spot anomalies by looking just at min, max values. For example, Age, salary, work experience, number of employees etc. cannot be negative, so if you spot a negative minimum value in these then take data cleaning steps.
While Training the machine learning algorithm Spotting the maximum error in training data and taking corrective steps by treating the bad rows which are also known as outliers.
Looking at the minimum and maximum errors in prediction can help to understand the predictive model’s performance. A high difference between min and max error means there are some cases where the model is failing. Ideally, both min and max error values must be as close to zero as possible.

Conclusion:

Min and Max values are measures of spread
Min and Max can be used to perform basic data validation
The range of data helps to understand the magnitude of the spread
The range does not tell what type of values are present in between min and max

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

Min, Max, Range

Data Science Tip: Data Exploration in Machine Learning

Conclusion:

Leave a Reply! Cancel Reply