It is very important to explore the data and understand it before you can use it for solving problems.
The most intuitive way of data exploration is by looking at its spread, which means how the data points are present in this data, you can start by asking below questions.
Question: What are the minimum and maximum values? Are these values seem sensible?
Answer: Min, Max, Range
Question: Are they placed close to each other? Or each data point is far away from others?
Answer: Standard deviation and Variance
Let us understand each one of these measures in detail and how can we use them to perform very quick data validation.
Min, Max, Range
Min- The smallest value present in the data.
Max- The largest value present in the data.
Range- The Minimum and Maximum values present in the data together are known as the Range of the data.

These values can be used to quickly identify whether there is any data-capture error or not?
For example, if you see an Age column in your data and using its summary details you find that the range of Age of employees is as below for a stock trading company.
1 2 3 |
# Range of Age values of employees Min(Age)= 10 Years Max(Age)= 50 Years |
It can be easily seen that there is something wrong in this data because logical age would be above a threshold, say 25 Years. Hence the minimum age cannot be 10 Years! So appropriate steps could be taken to correct the invalid data.
Further, Min and Max values give a sense of the data issues. If there is too much difference between min and max value then there may be outliers present in the data.
For example, if min value= 1 and max value= 100,000 then, it looks like one of these is an outlier, either on the lower side or on the higher side.
If there a large difference between Min and Max value then there may be outliers in the data.
Also, the Range of any dataset helps us understand about the magnitude of spread. For example, consider the below statements…
“The salary of IT employees ranges from $50,000 to $100,000 per year”.
“The salary of wall street traders ranges from $250,000 to $350,000 per year”
Both the statements help to understand the magnitude and spread of typical salaries for respective jobs
However, there are few questions which pop up in mind by looking at this sample salary range.
Are these salaries evenly distributed?
Are there just a few with high salaries?
Are there just a few with low salaries?
These questions cannot be answered just by looking at the min and max values because it does not help to understand what types of values are present in between.
Range does not tell what type of values are present in between min and max
To understand this we need another measure of Spread known as Standard Deviation discussed here “Stats 101: Measures of Spread [Standard Deviation]”
Data Science Tip: Data Exploration in Machine Learning
Looking at Range helps us in various stages of machine learning.
- During Data Exploration we can spot anomalies by looking just at min, max values. For example, Age, salary, work experience, number of employees etc. cannot be negative, so if you spot a negative minimum value in these then take data cleaning steps.
- While Training the machine learning algorithm Spotting the maximum error in training data and taking corrective steps by treating the bad rows which are also known as outliers.
- Looking at the minimum and maximum errors in prediction can help to understand the predictive model’s performance. A high difference between min and max error means there are some cases where the model is failing. Ideally, both min and max error values must be as close to zero as possible.
Conclusion:
- Min and
Max values are measures of spread - Min and Max can be used to perform basic data validation
- The range of data helps to understand the magnitude of the spread
- The range does not tell what type of values are present in between min and max