How to tune hyperparameters using Random Search CV in python

Hyperparameter tuning is one of the most important steps in machine learning. As the ML algorithms will not produce the highest accuracy out of the box. You need to tune their hyperparameters to achieve the best accuracy. You can follow any one of the below strategies to find the best parameters.

In this post, I will discuss the Random Search CV. The CV stands for cross-validation.

What is the difference between GridSearch CV and RandomSearchCV?

The main difference between these two techniques is the obligation to try all parameters. GridSearchCV has to try ALL the parameter combinations, however, RandomSearchCV can choose only a few ‘random’ combinations out of all the available combinations.

For example in the below parameter options, GridSearchCV will try all 20 combinations, however, for RandomSearchCV you can specify how many to try out of all these. by specifying a parameter called “n_iter“. If you keep n_iter=5 it means any random 5 combinations will be tried.

# Parameters to try
Parameter_Trials={'n_estimators':[100,200,300,500,1000],
                  'criterion':['gini','entropy'],
                  'max_depth': [2,3]}

# Parameters to try

Parameter_Trials={'n_estimators':[100,200,300,500,1000],

'criterion':['gini','entropy'],

'max_depth': [2,3]}

Exhaustive Combinations of hyperparameters

In the below code, the RandomizedSearchCV function will try any 5 combinations of hyperparameters.

We have specified cv=5. This means the model will be tested(cross-validated) 5 times. By dividing the data into 5 parts, choosing one part as testing and the other four as training data. The final accuracy for each combination of hyperparameter is the average of these five iterations.

Hence here total times the model will be fitted is n_iter=5 X cv=5 = 25 times!

n_jobs=1 specifies the number of parallel threads to run and verbose=5 means how much detail to print out while fitting the model, the higher the value, the more the details printed.

###################################################################
#### Create Loan Data for Classification in Python ####
import pandas as pd
import numpy as np
ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']
DataValues=[[480, 28, 610000, 'Yes'],
             [480, 42, 140000, 'No'],
             [480, 29, 420000, 'No'],
             [490, 30, 420000, 'No'],
             [500, 27, 420000, 'No'],
             [510, 34, 190000, 'No'],
             [550, 24, 330000, 'Yes'],
             [560, 34, 160000, 'Yes'],
             [560, 25, 300000, 'Yes'],
             [570, 34, 450000, 'Yes'],
             [590, 30, 140000, 'Yes'],
             [600, 33, 600000, 'Yes'],
             [600, 22, 400000, 'Yes'],
             [600, 25, 490000, 'Yes'],
             [610, 32, 120000, 'Yes'],
             [630, 29, 360000, 'Yes'],
             [630, 30, 480000, 'Yes'],
             [660, 29, 460000, 'Yes'],
             [700, 32, 470000, 'Yes'],
             [740, 28, 400000, 'Yes']]

#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
LoanData.head()

#Separate Target Variable and Predictor Variables
TargetVariable='APPROVE_LOAN'
Predictors=['CIBIL','AGE', 'SALARY']
X=LoanData[Predictors].values
y=LoanData[TargetVariable].values

############################################################
# Random Search CV
from sklearn.model_selection import RandomizedSearchCV

#Random Forest (Bagging of multiple Decision Trees)
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier()

# Parameters to try
Parameter_Trials={'n_estimators':[100,200,300,500,1000],
                  'criterion':['gini','entropy'],
                  'max_depth': [2,3]}

Random_Search = RandomizedSearchCV(RF, Parameter_Trials, n_iter=5, cv=5, n_jobs=1, verbose=5)
RandomSearchResults=Random_Search.fit(X,y)

###################################################################

#### Create Loan Data for Classification in Python ####

import pandas as pd

import numpy as np

ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']

DataValues=[[480, 28, 610000, 'Yes'],

[480, 42, 140000, 'No'],

[480, 29, 420000, 'No'],

[490, 30, 420000, 'No'],

[500, 27, 420000, 'No'],

[510, 34, 190000, 'No'],

[550, 24, 330000, 'Yes'],

[560, 34, 160000, 'Yes'],

[560, 25, 300000, 'Yes'],

[570, 34, 450000, 'Yes'],

[590, 30, 140000, 'Yes'],

[600, 33, 600000, 'Yes'],

[600, 22, 400000, 'Yes'],

[600, 25, 490000, 'Yes'],

[610, 32, 120000, 'Yes'],

[630, 29, 360000, 'Yes'],

[630, 30, 480000, 'Yes'],

[660, 29, 460000, 'Yes'],

[700, 32, 470000, 'Yes'],

[740, 28, 400000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

LoanData.head()

#Separate Target Variable and Predictor Variables

TargetVariable='APPROVE_LOAN'

Predictors=['CIBIL','AGE', 'SALARY']

X=LoanData[Predictors].values

y=LoanData[TargetVariable].values

############################################################

# Random Search CV

from sklearn.model_selection import RandomizedSearchCV

#Random Forest (Bagging of multiple Decision Trees)

from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier()

# Parameters to try

Parameter_Trials={'n_estimators':[100,200,300,500,1000],

'criterion':['gini','entropy'],

'max_depth': [2,3]}

Random_Search = RandomizedSearchCV(RF, Parameter_Trials, n_iter=5, cv=5, n_jobs=1, verbose=5)

RandomSearchResults=Random_Search.fit(X,y)

Sample Output

Hyperparameter tuning using RandomizedSearchCV

How to access the best hyperparameters?

The best combination of hyperparameters is stored as “best_params_” in the results.

# Fetching the best hyperparameters
RandomSearchResults.best_params_

# All the parameter combinations tried by RandomizedSearchCV
RandomSearchResults.cv_results_['params']

# Fetching the best hyperparameters

RandomSearchResults.best_params_

# All the parameter combinations tried by RandomizedSearchCV

RandomSearchResults.cv_results_['params']

Sample Output

Accessing best hyperparameters for RandomizedSearchCV

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

What is the difference between GridSearch CV and RandomSearchCV?

How to access the best hyperparameters?

Leave a Reply! Cancel Reply