Hyperparameter tuning is one of the most important steps in machine learning. As the ML algorithms will not produce the highest accuracy out of the box. You need to tune their hyperparameters to achieve the best accuracy. You can follow any one of the below strategies to find the best parameters.
- Manual Search
- Grid Search CV
- Random Search CV
- Bayesian Optimization
In this post, I will discuss Grid Search CV. The CV stands for cross-validation. Grid Search CV tries all the exhaustive combinations of parameter values supplied by you and chooses the best out of it.
Consider below example if you are providing a list of values to try for three hyperparameters then it will try all possible combinations. In this case, all combinations mean 5X2X2 = 20 combinations of hyperparameters. Hence adding one more additional hyperparameter will exponentially increase the number of combinations to try, hence increasing the time taken dramatically. You must be careful in choosing only the most important parameters to tune.
1 2 3 4 |
# Parameters to try Parameter_Trials={'n_estimators':[100,200,300,500,1000], 'criterion':['gini','entropy'], 'max_depth': [2,3]} |
How will I know what values to provide for each hyperparameter?
By looking at the online sklearn documentation for the algorithm or by using shift+tab after clicking on the algorithm function you can check the sample values of each hyperparameter.
With some experience, you will develop some ideas about what values work better for most of the data, hence, you can prepare a laundry list of good values you would want to try for each dataset.
In the below example GridSearchCV function performs the task of trying out all the parameter combinations provided. Here it turns out to be 20 combinations.
For each combination, GridSearchCV also performs cross-validation. You can specify the depth of Cross-Validation using the parameter ‘cv’.
cv=5 means, the data will be divided into 5 parts, one part will be used for testing and the other four parts for training. This is also known as K-fold Cross-validation of the model, here K=5. This will be repeated 5 times by changing the test data every time. The final accuracy is the average of these 5 times.
Any value between 5 to 10 is good for cross-validation. Remember, the higher the value, the more time it will take for computation because the model will be created those extra number of times.
If you choose cv=5 in the below case, then, 20X5=100 times the Random Forest model will be fitted.
Notice that, rows sampling is not done here as it is done by GridSearchCV based on the ‘cv’ input provided.
n_jobs=1 means how many parallel threads to be executed.
verbose=5 means print the model fitting details, the higher the value, the more the details printed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
################################################################### #### Create Loan Data for Classification in Python #### import pandas as pd import numpy as np ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN'] DataValues=[[480, 28, 610000, 'Yes'], [480, 42, 140000, 'No'], [480, 29, 420000, 'No'], [490, 30, 420000, 'No'], [500, 27, 420000, 'No'], [510, 34, 190000, 'No'], [550, 24, 330000, 'Yes'], [560, 34, 160000, 'Yes'], [560, 25, 300000, 'Yes'], [570, 34, 450000, 'Yes'], [590, 30, 140000, 'Yes'], [600, 33, 600000, 'Yes'], [600, 22, 400000, 'Yes'], [600, 25, 490000, 'Yes'], [610, 32, 120000, 'Yes'], [630, 29, 360000, 'Yes'], [630, 30, 480000, 'Yes'], [660, 29, 460000, 'Yes'], [700, 32, 470000, 'Yes'], [740, 28, 400000, 'Yes']] #Create the Data Frame LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames) LoanData.head() #Separate Target Variable and Predictor Variables TargetVariable='APPROVE_LOAN' Predictors=['CIBIL','AGE', 'SALARY'] X=LoanData[Predictors].values y=LoanData[TargetVariable].values ############################################################ # GridSearchCV from sklearn.model_selection import GridSearchCV #Random Forest (Bagging of multiple Decision Trees) from sklearn.ensemble import RandomForestClassifier RF = RandomForestClassifier() # Parameters to try Parameter_Trials={'n_estimators':[100,200,300,500,1000], 'criterion':['gini','entropy'], 'max_depth': [2,3]} Grid_Search = GridSearchCV(RF, Parameter_Trials, cv=5, n_jobs=1, verbose=5) GridSearchResults=Grid_Search.fit(X,y) |
Sample Output

How to access the best hyperparameters?
The best parameters are stored as “best_params_” inside the results. You can now create the Random Forest model using these best parameters.
1 2 3 4 5 |
# Fetching the best hyperparameters print(GridSearchResults.best_params_) # Looking at all the parameter combinations tried by GridSearch GridSearchResults.cv_results_['params'] |
Sample Output:
