How to test machine learning models using bootstrapping in Python

Before you put the ML model into production, it must be tested for accuracy. This is why we split the available data into training and testing. Typically 70% for training and the remaining 30% for testing the model. Why it is like this? you can understand the logic behind it here in this post as well as in the below video.

This activity of splitting the data randomly is called sampling. Now when you are measuring the accuracy of machine learning models, there is a chance that the sample which you have got is lucky! It means that the accuracy may come high due to the lucky split of data in a way where the testing data has very similar rows to training data, hence the model will perform better!

To rule out this luck factor, we try to perform sampling multiple times by changing the seed value in the train_test_split() function. This is called Bootstrapping. simply put, splitting the data into training and testing randomly “multiple times”.

How many times? Well, at least 5-times so that you are sure, the testing accuracy which you are getting was not just by chance, it is similar for all the different samples.

The final accuracy is the average of the accuracies from all sampling iterations.

Bootstrapping for ML models testing in Python
Bootstrapping for ML models testing in Python- overall flow

You can learn about different types of sampling in the below video.

In the below code I will show you how to test a decision tree regressor model using bootstrapping. The same concept applies to any other supervised ml algorithm.

Sample Output:

Bootstrapping in python

In the next post, I will talk about another popular method for testing machine learning models known as k-fold cross-validation.

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

Leave a Reply!

Your email address will not be published. Required fields are marked *