Stats 101: Measures of Spread [Standard Deviation]

Standard Deviation

Standard Deviation helps to understand ‘On an average, how far away each data point is from the mean value’

Let us take a very simple example to understand what exactly the standard deviation means.
Consider the first six numbers shown below.

Mean of these numbers will be (1+2+3+4+5+6)/6 = 3.5

Now, Standard deviation is a measure which tells me on an average how far each of these numbers is from the mean value.

Standard deviation tells me, on an average how far each of these numbers is from the mean value.

let’s calculate the distance of each number from the mean

Distance of 1 from mean = 1 – 3.5
Distance of 2 from mean = 2 – 3.5
Distance of 3 from mean = 2 – 3.5
So on and so forth…

Now the average distance will be the sum of all the distances divided by six as shown below.
However, if you take a close look into this expression, it evaluates to zero because of equal number of positive and negative values in the numerator.

Average distance of first six numbers from the mean

Hence, if we square the differences, each value becomes positive and it gives us the ‘on an average squared distance of each number from the mean. This is also known as the Variance.

Variance
Variance of first six numbers is 3.5

Variance tells you, what is the on an average squared distance of each value from the mean

But, our intension was to find out the average distance from the mean, which can be found if we take the square root of the variance.

Standard deviation is the square root of the variance

Standard deviation is also abbreviated as ‘std dev’. Take a look below at the std dev formula using our existing example.

The standard deviation of the first six numbers is 1.87

Hence, for the first six numbers, we get to understand that on an average each number is 1.87 units far away from mean value 3.5.

How do I use it for my analysis?

Low standard deviation means the data points are close to each other and do not vary much.
High standard deviation suggests either the data points are scattered or there are extremely low or extremely high values present in the data sets also known as outliers.

The same is the case for the variance. Low variance means the data points are close to each other and high variance means there are values that are far away from each other. You might have often heard the term “high variance” and “low variance” in stock markets. A stock with high variance means its price fluctuates a lot and it might be a small-cap stock. Investing in such stocks could be risky.

What is the ideal value of the standard deviation?

Ideally, the standard deviation should be 1. This is true for the hypothetical standard normal distribution where the mean value is zero and the standard deviation is 1.

High standard deviation example #1

Outliers increase standard deviation and variance both. This happens because the outlier will pull the mean value towards it and as a result, every value will be far away from the mean.

Outlier increases the distance from mean for most of the values

Outliers increase the value of standard deviation

High standard deviation example #2

Another way when standard deviation will go high is because of scattered values.

A simple way to recognize this is to see if the values are far away from each other. In such case, the mean value will be far away from most of the values and hence the on an average distance will increase.

Data Science Tip: How to use standard deviation?

  1. Data exploration: If the standard deviation of a feature/column is high, check for outliers in the data by looking at its distribution. Use the histogram to visualize the distribution.
  2. Feature selection: High standard deviation for a feature/column suggests biased distribution and may not be useful without treatment.
  3. During model fitting: If the standard deviation of residuals is high, check for possible outliers/business anomalies in training data.
  4. During Predictions: If the standard deviation is high for errors that means the predictions are wrong at many places.

Conclusion:

  • Standard deviation tells me, on an average how far each of these numbers is from the mean value.
  • Standard deviation is the square root of the variance
  • Ideally, the standard deviation should be 1
  • Outliers increase the value of the standard deviation

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

Leave a Reply!

Your email address will not be published. Required fields are marked *