K-fold cross validation splits the data into “K-parts”, then iteratively use one part for testing and other parts as training data.
How many parts should you divide the data? Popular choices are:
- K=5: Divide the data into five parts(20% each). Hence, 20% data for testing and 80% for training in every iteration
- K=10: Divide the data into ten parts(10% each). Hence 10% data for testing and 90% for training in every iteration.
As compared to the Bootstrapping approach, which relies on multiple random samples from full data, K-fold cross-validation is a systematic approach.
The final accuracy is the average accuracy of all iterations.

You can learn more about sampling in the below video.
In the below code snippet, I show you how you can perform K-fold cross-validation on a Decision Tree regressor. The Same approach is an application to all other algorithms.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
## Creating sample data ## import pandas as pd import numpy as np ColumnNames=['Hours','Calories', 'Weight'] DataValues=[[ 1.0, 2500, 95], [ 2.0, 2000, 85], [ 2.5, 1900, 83], [ 3.0, 1850, 81], [ 3.5, 1600, 80], [ 4.0, 1500, 78], [ 5.0, 1500, 77], [ 5.5, 1600, 80], [ 6.0, 1700, 75], [ 6.5, 1500, 70]] #Create the Data Frame GymData=pd.DataFrame(data=DataValues,columns=ColumnNames) GymData.head() #Separate Target Variable and Predictor Variables TargetVariable='Weight' Predictors=['Hours','Calories'] X=GymData[Predictors].values y=GymData[TargetVariable].values ####################################### ####### K-fold cross validation ####### # Defining a custom function to calculate accuracy # Make sure there are no zeros in the Target variable if you are using MAPE def Accuracy_Score(orig,pred): MAPE = np.mean(100 * (np.abs(orig-pred)/orig)) #print('#'*70,'Accuracy:', 100-MAPE) return(100-MAPE) # Custom Scoring MAPE calculation from sklearn.metrics import make_scorer custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True) # Importing cross validation function from sklearn from sklearn.model_selection import cross_val_score ###### Single Decision Tree Regression in Python ####### from sklearn import tree #choose from different tunable hyper parameters RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion='mse') # Running 10-Fold Cross validation on a given algorithm # Passing full data X and y because the K-fold will split the data and automatically choose train/test Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring) print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values) print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2)) |
Sample Output

In the next post, I will discuss another technique which is applicable when the data is dependent on time. Hence time-based systematic sampling is used for testing the models.
Hello, Sir. Thank you for this tutorial. You made it very simple to understand. Keep up the good work. 🙂
Hi Subhanullah!
Thank you so much for your kind words!
I am really happy this was helpful for you!
Good Post…