How to find best hyperparameters using GridSearchCV in python

Hyperparameter tuning is one of the most important steps in machine learning. As the ML algorithms will not produce the highest accuracy out of the box. You need to tune their hyperparameters to achieve the best accuracy. You can follow any one of the below strategies to find the best parameters.

In this post, I will discuss Grid Search CV. The CV stands for cross-validation. Grid Search CV tries all the exhaustive combinations of parameter values supplied by you and chooses the best out of it.

Consider below example if you are providing a list of values to try for three hyperparameters then it will try all possible combinations. In this case, all combinations mean 5X2X2 = 20 combinations of hyperparameters. Hence adding one more additional hyperparameter will exponentially increase the number of combinations to try, hence increasing the time taken dramatically. You must be careful in choosing only the most important parameters to tune.

# Parameters to try
Parameter_Trials={'n_estimators':[100,200,300,500,1000],
                  'criterion':['gini','entropy'],
                  'max_depth': [2,3]}

# Parameters to try

Parameter_Trials={'n_estimators':[100,200,300,500,1000],

'criterion':['gini','entropy'],

'max_depth': [2,3]}

How will I know what values to provide for each hyperparameter?

By looking at the online sklearn documentation for the algorithm or by using shift+tab after clicking on the algorithm function you can check the sample values of each hyperparameter.

With some experience, you will develop some ideas about what values work better for most of the data, hence, you can prepare a laundry list of good values you would want to try for each dataset.

In the below example GridSearchCV function performs the task of trying out all the parameter combinations provided. Here it turns out to be 20 combinations.

For each combination, GridSearchCV also performs cross-validation. You can specify the depth of Cross-Validation using the parameter ‘cv’.

cv=5 means, the data will be divided into 5 parts, one part will be used for testing and the other four parts for training. This is also known as K-fold Cross-validation of the model, here K=5. This will be repeated 5 times by changing the test data every time. The final accuracy is the average of these 5 times.

Any value between 5 to 10 is good for cross-validation. Remember, the higher the value, the more time it will take for computation because the model will be created those extra number of times.

If you choose cv=5 in the below case, then, 20X5=100 times the Random Forest model will be fitted.

Notice that, rows sampling is not done here as it is done by GridSearchCV based on the ‘cv’ input provided.

n_jobs=1 means how many parallel threads to be executed.

verbose=5 means print the model fitting details, the higher the value, the more the details printed.

###################################################################
#### Create Loan Data for Classification in Python ####
import pandas as pd
import numpy as np
ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']
DataValues=[[480, 28, 610000, 'Yes'],
             [480, 42, 140000, 'No'],
             [480, 29, 420000, 'No'],
             [490, 30, 420000, 'No'],
             [500, 27, 420000, 'No'],
             [510, 34, 190000, 'No'],
             [550, 24, 330000, 'Yes'],
             [560, 34, 160000, 'Yes'],
             [560, 25, 300000, 'Yes'],
             [570, 34, 450000, 'Yes'],
             [590, 30, 140000, 'Yes'],
             [600, 33, 600000, 'Yes'],
             [600, 22, 400000, 'Yes'],
             [600, 25, 490000, 'Yes'],
             [610, 32, 120000, 'Yes'],
             [630, 29, 360000, 'Yes'],
             [630, 30, 480000, 'Yes'],
             [660, 29, 460000, 'Yes'],
             [700, 32, 470000, 'Yes'],
             [740, 28, 400000, 'Yes']]

#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
LoanData.head()

#Separate Target Variable and Predictor Variables
TargetVariable='APPROVE_LOAN'
Predictors=['CIBIL','AGE', 'SALARY']
X=LoanData[Predictors].values
y=LoanData[TargetVariable].values

############################################################
# GridSearchCV
from sklearn.model_selection import GridSearchCV

#Random Forest (Bagging of multiple Decision Trees)
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier()

# Parameters to try
Parameter_Trials={'n_estimators':[100,200,300,500,1000],
                  'criterion':['gini','entropy'],
                  'max_depth': [2,3]}

Grid_Search = GridSearchCV(RF, Parameter_Trials, cv=5, n_jobs=1, verbose=5)
GridSearchResults=Grid_Search.fit(X,y)

###################################################################

#### Create Loan Data for Classification in Python ####

import pandas as pd

import numpy as np

ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']

DataValues=[[480, 28, 610000, 'Yes'],

[480, 42, 140000, 'No'],

[480, 29, 420000, 'No'],

[490, 30, 420000, 'No'],

[500, 27, 420000, 'No'],

[510, 34, 190000, 'No'],

[550, 24, 330000, 'Yes'],

[560, 34, 160000, 'Yes'],

[560, 25, 300000, 'Yes'],

[570, 34, 450000, 'Yes'],

[590, 30, 140000, 'Yes'],

[600, 33, 600000, 'Yes'],

[600, 22, 400000, 'Yes'],

[600, 25, 490000, 'Yes'],

[610, 32, 120000, 'Yes'],

[630, 29, 360000, 'Yes'],

[630, 30, 480000, 'Yes'],

[660, 29, 460000, 'Yes'],

[700, 32, 470000, 'Yes'],

[740, 28, 400000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

LoanData.head()

#Separate Target Variable and Predictor Variables

TargetVariable='APPROVE_LOAN'

Predictors=['CIBIL','AGE', 'SALARY']

X=LoanData[Predictors].values

y=LoanData[TargetVariable].values

############################################################

# GridSearchCV

from sklearn.model_selection import GridSearchCV

#Random Forest (Bagging of multiple Decision Trees)

from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier()

# Parameters to try

Parameter_Trials={'n_estimators':[100,200,300,500,1000],

'criterion':['gini','entropy'],

'max_depth': [2,3]}

Grid_Search = GridSearchCV(RF, Parameter_Trials, cv=5, n_jobs=1, verbose=5)

GridSearchResults=Grid_Search.fit(X,y)

Sample Output

How to access the best hyperparameters?

The best parameters are stored as “best_params_” inside the results. You can now create the Random Forest model using these best parameters.

# Fetching the best hyperparameters
print(GridSearchResults.best_params_)

# Looking at all the parameter combinations tried by GridSearch
GridSearchResults.cv_results_['params']

# Fetching the best hyperparameters

print(GridSearchResults.best_params_)

# Looking at all the parameter combinations tried by GridSearch

GridSearchResults.cv_results_['params']

Sample Output:

Finding best hyperparameters in GridSearchCV

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

How will I know what values to provide for each hyperparameter?

How to access the best hyperparameters?

Leave a Reply! Cancel Reply