How to find the best hyperparameters using Manual Search in Python

Hyperparameter tuning is one of the most important steps in machine learning. As the ML algorithms will not produce the highest accuracy out of the box. You need to tune their hyperparameters to achieve the best accuracy. You can follow any one of the below strategies to find the best parameters.

In this post, I have discussed about Manual search parameter tuning.

Manual Search is an ad-hoc approach to find the best values of hyperparameters for any machine learning algorithm. The idea is to first take big jumps in values and then small jumps to focus around a specific value which performed better.

For example, in the Random Forest algorithm, the n_estimators is the number of trees to grow. We can find the best value of this parameter by starting with big values like 100, 200, 500, 1000. and then once you know which one of them gave the best accuracy outcome you can choose to try values around that. e.g. if the accuracy was best around 500, then keep trying the values around it like 480, 490, 500, 510, 520, etc. Then choose whichever value gives the highest accuracy amongst it.

How will I know what values to try?

You can look at the sample values for each parameter in the function documentation using shift+tab after clicking on the function or the online sklearn documentation.

Mostly with some experience, you get a hang of good parameter values that work for most of the data.

In the below example you can try different values of n_estimators, criterion, and max_depth to get the highest accuracy value. Look for the weighted F1-Score value in the output. Whichever combination of parameters makes it highest. Choose those parameters as final.

###################################################################
#### Create Loan Data for Classification in Python ####
import pandas as pd
import numpy as np
ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']
DataValues=[[480, 28, 610000, 'Yes'],
             [480, 42, 140000, 'No'],
             [480, 29, 420000, 'No'],
             [490, 30, 420000, 'No'],
             [500, 27, 420000, 'No'],
             [510, 34, 190000, 'No'],
             [550, 24, 330000, 'Yes'],
             [560, 34, 160000, 'Yes'],
             [560, 25, 300000, 'Yes'],
             [570, 34, 450000, 'Yes'],
             [590, 30, 140000, 'Yes'],
             [600, 33, 600000, 'Yes'],
             [600, 22, 400000, 'Yes'],
             [600, 25, 490000, 'Yes'],
             [610, 32, 120000, 'Yes'],
             [630, 29, 360000, 'Yes'],
             [630, 30, 480000, 'Yes'],
             [660, 29, 460000, 'Yes'],
             [700, 32, 470000, 'Yes'],
             [740, 28, 400000, 'Yes']]

#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
LoanData.head()

#Separate Target Variable and Predictor Variables
TargetVariable='APPROVE_LOAN'
Predictors=['CIBIL','AGE', 'SALARY']
X=LoanData[Predictors].values
y=LoanData[TargetVariable].values

# Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

###################################################################
#### Trying out different hyperparameters for Random Forest ####
from sklearn.ensemble import RandomForestClassifier

# Change values of n_estimators = 100, 200, 300, 400 etc.
# Change values of criterion = 'gini', 'entropy' etc.
# Change values of max_depth as 3, 4, 5 etc.
clf = RandomForestClassifier(max_depth=3, n_estimators=100,criterion='gini')
 
#Printing all the parameters of Random Forest
print(clf)
 
#Creating the model on Training Data
RF=clf.fit(X_train,y_train)
prediction=RF.predict(X_test)
 
#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))
 
#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')
 
#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###################################################################

#### Create Loan Data for Classification in Python ####

import pandas as pd

import numpy as np

ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']

DataValues=[[480, 28, 610000, 'Yes'],

[480, 42, 140000, 'No'],

[480, 29, 420000, 'No'],

[490, 30, 420000, 'No'],

[500, 27, 420000, 'No'],

[510, 34, 190000, 'No'],

[550, 24, 330000, 'Yes'],

[560, 34, 160000, 'Yes'],

[560, 25, 300000, 'Yes'],

[570, 34, 450000, 'Yes'],

[590, 30, 140000, 'Yes'],

[600, 33, 600000, 'Yes'],

[600, 22, 400000, 'Yes'],

[600, 25, 490000, 'Yes'],

[610, 32, 120000, 'Yes'],

[630, 29, 360000, 'Yes'],

[630, 30, 480000, 'Yes'],

[660, 29, 460000, 'Yes'],

[700, 32, 470000, 'Yes'],

[740, 28, 400000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

LoanData.head()

#Separate Target Variable and Predictor Variables

TargetVariable='APPROVE_LOAN'

Predictors=['CIBIL','AGE', 'SALARY']

X=LoanData[Predictors].values

y=LoanData[TargetVariable].values

# Split the data into training and testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

###################################################################

#### Trying out different hyperparameters for Random Forest ####

from sklearn.ensemble import RandomForestClassifier

# Change values of n_estimators = 100, 200, 300, 400 etc.

# Change values of criterion = 'gini', 'entropy' etc.

# Change values of max_depth as 3, 4, 5 etc.

clf = RandomForestClassifier(max_depth=3, n_estimators=100,criterion='gini')

#Printing all the parameters of Random Forest

print(clf)

#Creating the model on Training Data

RF=clf.fit(X_train,y_train)

prediction=RF.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(RF.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Sample Output:

Hyper Parameter tuning of Random Forest using Manual Search

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

How will I know what values to try?

Leave a Reply! Cancel Reply