What is the use of Sampling Theory in Data Science

Sampling Theory helps you to examine how good the predictive model will perform BEFORE it is deployed in production.

What is the Sampling Theory?

Sampling means choosing random rows from a dataset.

Sampling theory says, If you select the rows randomly then the selected subset of the data represents the whole data. A detailed explanation about the sampling theory can be read here.

Consider the below example, here the full data has 10 rows. It lists down the displacement of an engine and the horsepower.

If you need to represent the full data by picking some sample rows then how many rows should be selected? The popular choice is 70% rows.

So randomly selected 70% rows are picked as Training Data and other rows that were not picked in Training Data are the rest 30% known as Testing Data.

Splitting the full data into Training and Testing based on random sampling

Why 70:30? Can I select 80:20?

Yes! 70:30 or 80:20 or 75:25 all are acceptable ratios of training and testing data. The idea is to select ENOUGH rows in training data which can cover all the types of patterns in data for the model to learn and testing data also holds enough rows to test the learning.

What is Training data?

A major part (usually 70%) of data randomly selected from the full data. This chunk of data is used to train the predictive model. These are the examples that the predictive model uses to learn the patterns.

What is Testing data?

That part (rest 30% ) of full data which was NOT selected in Training data. This chunk of data is used to TEST the predictive model for its performance.

These are the examples that are UNSEEN by the predictive model. Hence, these are used to test the accuracy by comparing the predicted values with the original values.

Why don’t we test the model on Training data?

Because those examples are already SEEN by the model. Hence, the predictions are bound to be accurate! The real test of the model is when it predicts based on inputs which are UNSEEN.

Math Exam in School Days!

Let’s take an example of the math exam in school days.
To appear in the maths exam and score well, you practiced a lot of sums. But, the sum which was asked in the exam was slightly different than all of the sums previously seen.

If you had solved a similar sum before, you will be able to crack the one asked in the exam.

If you never solved a similar sum before, you will fail to solve that sum asked in the exam.

Mathematical understanding is tested by making you solve a slightly different sum with the same underlying concept.

Similarly, predictive models are trained on training data and then tested for accuracy on testing data to check if the model has learned the patterns efficiently or not because we know the answers for the testing data already which is unknown to the predictive model.

Predictive models are created on training data and then tested for accuracy on testing data to check if the model has learned the patterns efficiently or not.

Before a predictive model goes live, it is important to understand whether the model understands the patterns in the data in a generic way or not?

If an input is seen by the model which is similar to one of the inputs in training data then the prediction accuracy is high otherwise it is low.

What if the selected sample does not represent full data?

It is fair to assume that while selecting random samples, it may occur that the sample is not the true representative of the full data. Some of the patterns may be missed.

This is why one never relies on the accuracy results based on one single round of sampling-training-testing.

Performing multiple rounds of sampling will make sure, all types of possibilities are covered. This technique is also known as Bootstrapping.

If you perform 5 rounds of sampling then it is known as 5-step Bootstrapping, if you perform 10 rounds of sampling then it is known as 10- step Bootstrapping.

Performing multiple rounds of sampling-training-testing is known as n-step Bootstrapping

Before deploying the predictive model in the production environment, it is thoroughly checked by simulating the real-world scenario multiple times. This basically means below flow…

Random sampling fresh set of training data and testing data from the full data
Creating the predictive model on training data
Testing the accuracy of the predictive model in testing data
Repeat Steps 1:3 at least 5 times(5- step Bootstrapping )
Final accuracy is the average of all the accuracy values in each sampling step

How to perform sampling in R for Data Science

This is a critical step in machine learning when you selected those examples(rows) from the data which will be used by the algorithm to learn the patterns and the result will be a predictive model.

Below is the R code snippet to perform sampling on any data for machine learning.

# Creating Full Data using first 10 rows with disp and hp columns in mtcars dataset
FullData= mtcars[1:10, c('disp','hp')]

# Printing the displacement and horsepower of 10 cars
print(FullData)

# Creating random index for training data (70% of total rows)
# Everytime below snippet is run, it will generate different set of index values. #Randomness!
RandomIndex=sample(1: nrow(FullData), size=0.7 * nrow(FullData))
print(RandomIndex)

# Creating Training data by choosing random index rows
TrainData=FullData[RandomIndex, ]
print(TrainData)

# Creating Tesing data by choosing those rows which are not in Training Data
TestData=FullData[ -RandomIndex, ]
print(TestData)

# Creating Full Data using first 10 rows with disp and hp columns in mtcars dataset

FullData= mtcars[1:10, c('disp','hp')]

# Printing the displacement and horsepower of 10 cars

print(FullData)

# Creating random index for training data (70% of total rows)

# Everytime below snippet is run, it will generate different set of index values. #Randomness!

RandomIndex=sample(1: nrow(FullData), size=0.7 * nrow(FullData))

print(RandomIndex)

# Creating Training data by choosing random index rows

TrainData=FullData[RandomIndex, ]

print(TrainData)

# Creating Tesing data by choosing those rows which are not in Training Data

TestData=FullData[ -RandomIndex, ]

print(TestData)

Output:

Randomly selected index

Conclusion

Sampling is needed in data science to judge the predictive model’s performance before deploying it in production
Data is split into two parts with a ratio like 70:30, 80:20 or 75:25 by selecting random rows
The bigger part is called Training data. It is used to train the predictive model
The smaller part is called Testing data. It is used to test the predictive model

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com