How to test ML models using K-fold cross-validation in Python

K-fold cross validation splits the data into “K-parts”, then iteratively use one part for testing and other parts as training data.

How many parts should you divide the data? Popular choices are:

K=5: Divide the data into five parts(20% each). Hence, 20% data for testing and 80% for training in every iteration
K=10: Divide the data into ten parts(10% each). Hence 10% data for testing and 90% for training in every iteration.

As compared to the Bootstrapping approach, which relies on multiple random samples from full data, K-fold cross-validation is a systematic approach.

The final accuracy is the average accuracy of all iterations.

Overall flow of K-fold cross-validation for ML models testing

You can learn more about sampling in the below video.

In the below code snippet, I show you how you can perform K-fold cross-validation on a Decision Tree regressor. The Same approach is an application to all other algorithms.

## Creating sample data ##
import pandas as pd
import numpy as np
ColumnNames=['Hours','Calories', 'Weight']
DataValues=[[  1.0,   2500,   95],
             [  2.0,   2000,   85],
             [  2.5,   1900,   83],
             [  3.0,   1850,   81],
             [  3.5,   1600,   80],
             [  4.0,   1500,   78],
             [  5.0,   1500,   77],
             [  5.5,   1600,   80],
             [  6.0,   1700,   75],
             [  6.5,   1500,   70]]
#Create the Data Frame
GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)
GymData.head()

#Separate Target Variable and Predictor Variables
TargetVariable='Weight'
Predictors=['Hours','Calories']
X=GymData[Predictors].values
y=GymData[TargetVariable].values

#######################################
####### K-fold cross validation #######
# Defining a custom function to calculate accuracy
# Make sure there are no zeros in the Target variable if you are using MAPE
def Accuracy_Score(orig,pred):
    MAPE = np.mean(100 * (np.abs(orig-pred)/orig))
    #print('#'*70,'Accuracy:', 100-MAPE)
    return(100-MAPE)

# Custom Scoring MAPE calculation
from sklearn.metrics import make_scorer
custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score


###### Single Decision Tree Regression in Python #######
from sklearn import tree
#choose from different tunable hyper parameters
RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion='mse')

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))

## Creating sample data ##

import pandas as pd

import numpy as np

ColumnNames=['Hours','Calories', 'Weight']

DataValues=[[ 1.0, 2500, 95],

[ 2.0, 2000, 85],

[ 2.5, 1900, 83],

[ 3.0, 1850, 81],

[ 3.5, 1600, 80],

[ 4.0, 1500, 78],

[ 5.0, 1500, 77],

[ 5.5, 1600, 80],

[ 6.0, 1700, 75],

[ 6.5, 1500, 70]]

#Create the Data Frame

GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)

GymData.head()

#Separate Target Variable and Predictor Variables

TargetVariable='Weight'

Predictors=['Hours','Calories']

X=GymData[Predictors].values

y=GymData[TargetVariable].values

#######################################

####### K-fold cross validation #######

# Defining a custom function to calculate accuracy

# Make sure there are no zeros in the Target variable if you are using MAPE

def Accuracy_Score(orig,pred):

MAPE = np.mean(100 * (np.abs(orig-pred)/orig))

#print('#'*70,'Accuracy:', 100-MAPE)

return(100-MAPE)

# Custom Scoring MAPE calculation

from sklearn.metrics import make_scorer

custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn

from sklearn.model_selection import cross_val_score

###### Single Decision Tree Regression in Python #######

from sklearn import tree

#choose from different tunable hyper parameters

RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion='mse')

# Running 10-Fold Cross validation on a given algorithm

# Passing full data X and y because the K-fold will split the data and automatically choose train/test

Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)

print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)

print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))

Sample Output

K-fold cross validation sample values in Python

In the next post, I will discuss another technique which is applicable when the data is dependent on time. Hence time-based systematic sampling is used for testing the models.

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

3 thoughts on “How to test ML models using K-fold cross-validation in Python”

Subhanullah Asim
August 10, 2021 at 5:41 am

Hello, Sir. Thank you for this tutorial. You made it very simple to understand. Keep up the good work. 🙂

1. Farukh Hashmi
  August 10, 2021 at 9:16 am
  
  Hi Subhanullah!
  
  Thank you so much for your kind words!
  I am really happy this was helpful for you!
  
Bambang Siswoyo
February 11, 2022 at 1:07 pm

Good Post…

3 thoughts on “How to test ML models using K-fold cross-validation in Python”

Leave a Reply! Cancel Reply