Data Science Interview Questions for IT Industry Part-3: Supervised ML

In the previous post, I discussed the major types of machine learning. In this post, I will discuss the popular supervised machine learning algorithms which are asked in the data science interviews.

I will also share the code to implement each one them in R/Python.

Overview
Sample Datasets
Linear Regression
Logistic Regression
Decision Tree
Random Forest
Adaboost
XGBoost
KNN
SVM

Overview

In most of the IT projects, you will be able to find a use case for Supervised Machine learning. Because there will always be a need to predict a future number like sales, turnover, number of support tickets or the demand for a product, etc. This is when you will perform Regression.

Similarly, there will always be a need to classify records. e.g. If a home loan should be issued or not? What is the value of a customer (Silver, Gold, Platinum), is this insurance claim fraud or not? etc. This is when you will perform Classification

Supervised Machine Learning algorithms can be grouped into two categories (Regression and Classification) listed below. There are many statistical algorithms which are used for Supervised Machine Learning. I am listing some of the popular ones here which can be used to solve any given Regression or Classification problem with good accuracy.

Regression(Predicting a continuous number): Linear Regression, Decision Trees, Random Forests, XGboost, Adaboost.

Classification(Predicting a class): Logistic Regression, Decision Trees, Random Forests, XGboost Adaboost, KNN, SVM.

Regression and Classification are two types of Supervised Machine Learning algorithms. There are various algorithms for each of them like Linear Regression, Logistic Regression, Decision Trees, Random Forests, XGboost, Adaboost. — Popular algorithms used for
Supervised Machine Learning

As you can see some of the algorithms are suitable for regression as well as classification. The final choice is made by measuring the accuracy of the model.

When you are asked about any of the algorithms, try to use below flow while describing it.

What is the core concept of the algorithm (Overview)?
The Mathematical logic behind the algorithm.
In which language(R/Python) you have implemented this algorithm
What are the hyperparameters you tuned and the effect of these parameters?
How do you measure its accuracy?
Which type of use cases fit this algorithm/Why it was selected for your project.

Sample Datasets

I have used two simple datasets for regression and classification algorithms. I have written the code below to generate these in R and Python.

Regression: This small dataset contains the weight and number of hours samples for random people who spend time at the Gym along with how many calories they consume in a day.
Target variable: Weight
Predictors: Hours, Calories

Predict Weight of a person based on Hours and Calories consumed

Classification: This small dataset contains the loan approval history for various applicants based on their AGE, SALARY and CIBIL score.
Target variable: APPROVE_LOAN
Predictors: CIBIL, AGE, SALARY

Predict if a loan should be approved or not based on AGE, CIBIL, and SALARY

Create sample datasets in R

#### Create Gym Data for regression in R ####
# Generating Data Vectors
Hours=c(1.0, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0, 5.5, 6.0, 6.5)
Calories=c(2500, 2000, 1900, 1850, 1600, 1500, 1500, 1600, 1700, 1500)
Weight=c(95, 85, 83, 81, 80, 78, 77, 80, 75, 70)
# Creating the sample dataframe

GymData=data.frame(Hours, Calories, Weight)
print(GymData)

# Sampling | Splitting data into 70% for training 30% for testing
TrainingSampleIndex=sample(1:nrow(GymData), size=0.7 * nrow(GymData) )
DataForMLTrain=GymData[TrainingSampleIndex, ]
DataForMLTest=GymData[-TrainingSampleIndex, ]
dim(DataForMLTrain)
dim(DataForMLTest) 

#######################################################
#### Create Loan Data for classification in R ####
# Generating Data Vectors
CIBIL=c(480,480,480,490,500,510,550,560,560,570,590,600,600,600,610,630,630,660,700,740)
AGE=c(28,42,29,30,27,34,24,34,25,34,30,33,22,25,32,29,30,29,32,28)
SALARY=c(610000,140000,420000,420000,420000,190000,330000,160000,300000,&nbsp;
450000,140000,600000,400000,490000,120000,360000,480000,460000,470000,400000)
APPROVE_LOAN=c("Yes","No","No","No","No","No","Yes","Yes","Yes","Yes","Yes",
"Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes")
 
LoanData=data.frame(CIBIL,AGE,SALARY,APPROVE_LOAN)
print(LoanData)

# Sampling | Splitting data into 70% for training 30% for testing
TrainingSampleIndex=sample(1:nrow(LoanData), size=0.7 * nrow(LoanData) )
DataForMLTrain=LoanData[TrainingSampleIndex, ]
DataForMLTest=LoanData[-TrainingSampleIndex, ]
dim(DataForMLTrain)
dim(DataForMLTest)

#### Create Gym Data for regression in R ####

# Generating Data Vectors

Hours=c(1.0, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0, 5.5, 6.0, 6.5)

Calories=c(2500, 2000, 1900, 1850, 1600, 1500, 1500, 1600, 1700, 1500)

Weight=c(95, 85, 83, 81, 80, 78, 77, 80, 75, 70)

# Creating the sample dataframe

GymData=data.frame(Hours, Calories, Weight)

print(GymData)

# Sampling | Splitting data into 70% for training 30% for testing

TrainingSampleIndex=sample(1:nrow(GymData), size=0.7 * nrow(GymData) )

DataForMLTrain=GymData[TrainingSampleIndex, ]

DataForMLTest=GymData[-TrainingSampleIndex, ]

dim(DataForMLTrain)

dim(DataForMLTest)

#######################################################

#### Create Loan Data for classification in R ####

# Generating Data Vectors

CIBIL=c(480,480,480,490,500,510,550,560,560,570,590,600,600,600,610,630,630,660,700,740)

AGE=c(28,42,29,30,27,34,24,34,25,34,30,33,22,25,32,29,30,29,32,28)

SALARY=c(610000,140000,420000,420000,420000,190000,330000,160000,300000, 

450000,140000,600000,400000,490000,120000,360000,480000,460000,470000,400000)

APPROVE_LOAN=c("Yes","No","No","No","No","No","Yes","Yes","Yes","Yes","Yes",

"Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes")

LoanData=data.frame(CIBIL,AGE,SALARY,APPROVE_LOAN)

print(LoanData)

# Sampling | Splitting data into 70% for training 30% for testing

TrainingSampleIndex=sample(1:nrow(LoanData), size=0.7 * nrow(LoanData) )

DataForMLTrain=LoanData[TrainingSampleIndex, ]

DataForMLTest=LoanData[-TrainingSampleIndex, ]

dim(DataForMLTrain)

dim(DataForMLTest)

Create sample datasets in Python

#### Create Gym Data for regression in Python ####
import pandas as pd
import numpy as np
ColumnNames=['Hours','Calories', 'Weight']
DataValues=[[  1.0,   2500,   95],
             [  2.0,   2000,   85],
             [  2.5,   1900,   83],
             [  3.0,   1850,   81],
             [  3.5,   1600,   80],
             [  4.0,   1500,   78],
             [  5.0,   1500,   77],
             [  5.5,   1600,   80],
             [  6.0,   1700,   75],
             [  6.5,   1500,   70]]
#Create the Data Frame
GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)
GymData.head()

#Separate Target Variable and Predictor Variables
TargetVariable='Weight'
Predictors=['Hours','Calories']
X=GymData[Predictors].values
y=GymData[TargetVariable].values

#Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

###################################################################
#### Create Loan Data for Classification in Python ####
import pandas as pd
import numpy as np
ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']
DataValues=[[480, 28, 610000, 'Yes'],
             [480, 42, 140000, 'No'],
             [480, 29, 420000, 'No'],
             [490, 30, 420000, 'No'],
             [500, 27, 420000, 'No'],
             [510, 34, 190000, 'No'],
             [550, 24, 330000, 'Yes'],
             [560, 34, 160000, 'Yes'],
             [560, 25, 300000, 'Yes'],
             [570, 34, 450000, 'Yes'],
             [590, 30, 140000, 'Yes'],
             [600, 33, 600000, 'Yes'],
             [600, 22, 400000, 'Yes'],
             [600, 25, 490000, 'Yes'],
             [610, 32, 120000, 'Yes'],
             [630, 29, 360000, 'Yes'],
             [630, 30, 480000, 'Yes'],
             [660, 29, 460000, 'Yes'],
             [700, 32, 470000, 'Yes'],
             [740, 28, 400000, 'Yes']]

#Create the Data Frame
LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)
LoanData.head()

#Separate Target Variable and Predictor Variables
TargetVariable='APPROVE_LOAN'
Predictors=['CIBIL','AGE', 'SALARY']
X=LoanData[Predictors].values
y=LoanData[TargetVariable].values

#Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Create Gym Data for regression in Python ####

import pandas as pd

import numpy as np

ColumnNames=['Hours','Calories', 'Weight']

DataValues=[[ 1.0, 2500, 95],

[ 2.0, 2000, 85],

[ 2.5, 1900, 83],

[ 3.0, 1850, 81],

[ 3.5, 1600, 80],

[ 4.0, 1500, 78],

[ 5.0, 1500, 77],

[ 5.5, 1600, 80],

[ 6.0, 1700, 75],

[ 6.5, 1500, 70]]

#Create the Data Frame

GymData=pd.DataFrame(data=DataValues,columns=ColumnNames)

GymData.head()

#Separate Target Variable and Predictor Variables

TargetVariable='Weight'

Predictors=['Hours','Calories']

X=GymData[Predictors].values

y=GymData[TargetVariable].values

#Split the data into training and testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

###################################################################

#### Create Loan Data for Classification in Python ####

import pandas as pd

import numpy as np

ColumnNames=['CIBIL','AGE', 'SALARY', 'APPROVE_LOAN']

DataValues=[[480, 28, 610000, 'Yes'],

[480, 42, 140000, 'No'],

[480, 29, 420000, 'No'],

[490, 30, 420000, 'No'],

[500, 27, 420000, 'No'],

[510, 34, 190000, 'No'],

[550, 24, 330000, 'Yes'],

[560, 34, 160000, 'Yes'],

[560, 25, 300000, 'Yes'],

[570, 34, 450000, 'Yes'],

[590, 30, 140000, 'Yes'],

[600, 33, 600000, 'Yes'],

[600, 22, 400000, 'Yes'],

[600, 25, 490000, 'Yes'],

[610, 32, 120000, 'Yes'],

[630, 29, 360000, 'Yes'],

[630, 30, 480000, 'Yes'],

[660, 29, 460000, 'Yes'],

[700, 32, 470000, 'Yes'],

[740, 28, 400000, 'Yes']]

#Create the Data Frame

LoanData=pd.DataFrame(data=DataValues,columns=ColumnNames)

LoanData.head()

#Separate Target Variable and Predictor Variables

TargetVariable='APPROVE_LOAN'

Predictors=['CIBIL','AGE', 'SALARY']

X=LoanData[Predictors].values

y=LoanData[TargetVariable].values

#Split the data into training and testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let’s get started with the questions!

Q. Explain how the Linear Regression algorithm works?

How to perform Linear Regression in R and Python — Linear Regression in Machine Learning

Linear regression algorithm is based on the equation of a line (y= m * X + C).

The assumption is that the variables are related in a linear way.

Simple Linear Regression.
Only one independent variable.

In Machine Learning, the target variable is ‘y’ and the predictor is ‘X’.

The linear regression algorithm helps to find the best values of the coefficients ‘m’ and ‘C’ for given data using a cost function. One of the cost functions is known as the Sum of Squared Errors(SSE). There are many others like Root Mean Squared Error(RMSE), Mean of Squared Errors (MAE), log loss or Cross-Entropy loss, etc.

Example: Predict the weight of a person based on the number of hours spent at the gym and the amount of calories intake every day.

Target Variable: Weight

Predictors: Hours, Calories

Simple Linear Regression Equation: Weight = m* (Hours) + C

Multiple Linear Regression Equation: Weight = m1 * (Hours) + m2 * (Calories)+ C

For multiple linear regression, the algorithm finds the best values of the coefficients m1, m2, and C.

If there is one predictor then its called simple linear regression, if there are more than one predictors then it is called multiple linear regression.

Normal linear regression(Ordinary Least Square Regression) does not have any hyperparameters to tune.

If you are using regularization, then either Ridge or Lasso is used.

Regularization means reducing the effect of a variable in the equation by multiplying it by a very small coefficient. Hence, penalize that variable. This penalty can be given in two ways, first is Ridge and second is LASSO.

Ridge regression improves prediction error by shrinking large regression coefficients in order to reduce overfitting, but it does not perform variable selection and therefore does not help to make the model more interpretable. This is where LASSO comes into the picture.

Lasso(Least Absolute Shrinkage and Selection Operator) is able to improve the prediction accuracy and perform the variable selection by forcing certain coefficients(m1, m2, etc) to be set to zero, hence, choosing a simpler model that does not include those predictors.

Accuracy Measurement is done using Mean Absolute Percent Error Or Root Mean Squared Error.

Accuracy=100- MAPE or 100-RMSE.

Q. How to do linear regression in R?

###### Linear Regression in R #######
Model_Reg=lm(formula= Weight ~ Hours+Calories,data=DataForMLTrain)

# Measuring Goodness of Fit using R2 value on TRAINING DATA
Orig=DataForMLTrain$Weight
Pred=predict(Model_Reg, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))
print(paste('R2 Value is:',round(R2,2)))

# Predictions of model on Testing data
DataForMLTest$Prediction=predict(Model_Reg, DataForMLTest)
head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA
LM_APE= 100 *(abs(DataForMLTest$Prediction - DataForMLTest$Weight)/DataForMLTest$Weight)
print(paste('### Mean Accuracy of Linear Regression Model is: ', 100 - mean(LM_APE)))
print(paste('### Median Accuracy of Linear Regression Model is: ', 100 - median(LM_APE)))

###### Linear Regression in R #######

Model_Reg=lm(formula= Weight ~ Hours+Calories,data=DataForMLTrain)

# Measuring Goodness of Fit using R2 value on TRAINING DATA

Orig=DataForMLTrain$Weight

Pred=predict(Model_Reg, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))

print(paste('R2 Value is:',round(R2,2)))

# Predictions of model on Testing data

DataForMLTest$Prediction=predict(Model_Reg, DataForMLTest)

head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA

LM_APE= 100 *(abs(DataForMLTest$Prediction - DataForMLTest$Weight)/DataForMLTest$Weight)

print(paste('### Mean Accuracy of Linear Regression Model is: ', 100 - mean(LM_APE)))

print(paste('### Median Accuracy of Linear Regression Model is: ', 100 - median(LM_APE)))

Q. How to do linear regression in Python?

###### Linear Regression in Python #######
import pandas as pd
from sklearn.linear_model import LinearRegression
RegModel = LinearRegression()

#Printing all the parameters of Linear regression
print(RegModel)

#Creating the model on Training Data
LREG=RegModel.fit(X_train,y_train)
prediction=LREG.predict(X_test)
from sklearn import metrics

#Measuring Goodness of fit in Training data
print('R2 Value:',metrics.r2_score(y_train, LREG.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

###### Linear Regression in Python #######

import pandas as pd

from sklearn.linear_model import LinearRegression

RegModel = LinearRegression()

#Printing all the parameters of Linear regression

print(RegModel)

#Creating the model on Training Data

LREG=RegModel.fit(X_train,y_train)

prediction=LREG.predict(X_test)

from sklearn import metrics

#Measuring Goodness of fit in Training data

print('R2 Value:',metrics.r2_score(y_train, LREG.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Q. Explain how Logistic Regression algorithm works?

What is the syntax for Logistic regression in R and Python — Logistic Regression in Machine Learning

Logistic Regression is used for predicting a category, specially the Binary categories(Yes/No , 0/1).

For example, whether to approve a loan or not (Yes/No)? Which group does this customer belong to (Silver/Gold/Platinum)? etc.

When there are only two outcomes in Target Variable it is known as Binomial Logistic Regression.

If there are more than two outcomes in Target Variable it is known as Multinomial Logistic Regression.

If the outcomes in Target Variable are ordinal and there is a natural ordering in the values (eg. Small<Medium<Large) then it is known as Ordinal Logistic Regression.

Logistic Regression outputs the probability of an event

Logistic regression is based on logit function logit(x) = log(x / (1 – x))

The output is a value between 0 to 1. It is the probability of an event’s occurrence.

E.g. There is an 80% chance that the loan application is good, approve it.

The coefficients β0, β1, β2, β3… are found using Maximum Likelihood Estimation Technique. Basically, if the Target Variable’s value (y) is 1, then the probability of one “P(1)” should be as close to 1 as possible and the probability of zero “P(0)” should be as close to 0 as possible. Find those coefficients which satisfy both the conditions.

The Goodness of Fit (AIC and Deviance ):

These measures help to understand how good the model fits the data. Please note This is NOT the Accuracy of the model.

AIC The Akaike Information Criterion (AIC)
provides a method for assessing the quality
of your model through comparison of
related models. The number itself is not
meaningful. If you have more than one
model then you
should select the model that has the
smallest AIC.

The null deviance shows how well the
Target variable is predicted by a model that
includes only the intercept (grand mean).

The Residual deviance shows how well the target variable is predicted by a model that includes multiple independent predictors

Accuracy Measurement is done using f1-Score/Precision/Recall/ROC/AUC.

Q. How to do Logistic Regression in R?

###### Logistic Regression in R #######
# Creating the model on TRAINING data
Model_LR=glm(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain, family='binomial')

# Checking Accuracy of model on Testing data
DataForMLTest$Prediction=predict(Model_LR, DataForMLTest)
head(DataForMLTest)

#Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)
AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

#Since AccuracyResults is a list of multiple items, fetching useful components only
AccuracyResults[['table']]
AccuracyResults[['byClass']]
print(paste('### Overall Accuracy of Logistic Regression Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

###### Logistic Regression in R #######

# Creating the model on TRAINING data

Model_LR=glm(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain, family='binomial')

# Checking Accuracy of model on Testing data

DataForMLTest$Prediction=predict(Model_LR, DataForMLTest)

head(DataForMLTest)

#Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)

AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

#Since AccuracyResults is a list of multiple items, fetching useful components only

AccuracyResults[['table']]

AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of Logistic Regression Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

Q. How to do Logistic Regression in Python?

###### Logistic Regression in Python #######
import pandas as pd
from sklearn.linear_model import LogisticRegression
#choose parameter Penalty='l1' or C=1
clf = LogisticRegression(C=1,penalty='l1')

#Printing all the parameters of logistic regression
print(clf)

#Creating the model on Training Data
LOG=clf.fit(X_train,y_train)
prediction=LOG.predict(X_test)

#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###### Logistic Regression in Python #######

import pandas as pd

from sklearn.linear_model import LogisticRegression

#choose parameter Penalty='l1' or C=1

clf = LogisticRegression(C=1,penalty='l1')

#Printing all the parameters of logistic regression

print(clf)

#Creating the model on Training Data

LOG=clf.fit(X_train,y_train)

prediction=LOG.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Q. Explain how the Decision Tree algorithm works?

Decision Trees are suitable for both Regression as well as Classification use cases.

Below Video explains how Decision Trees work for Regression use cases.

Decision Trees for Regression in Machine Learning

The below video explains how Decision Trees work for Classification.

Decision trees select the best predictor out of all available predictors by measuring its efficiency using either Entropy, Information gain or Gini index
Basic idea is to choose that predictor which helps to slice the data into two parts in such a way that each part contains similar values of the target variable
This activity creates a “root node”. Basically the first ‘if-statement’ to check to make decisions
keep repeating the same activity with both the slices of data until no further splits can be made.

Consider below example where Loan approval historical data is provided.
Target Variable: APPROVE_LOAN
Predictors: CIBIL, AGE, SALARY

Now, we need to find a “rule” or set of rules using this data which can help to make the decision about a new loan application.

Historical Loan Approval Data to understand decision trees — Historical Loan Approval Data

The Decision Tree algorithm will try to find the best splitter of data.

Basically, out of AGE, SALARY and CIBIL score of a loan applicant, which predictor “segregates” all the “Yes” cases together and all the “No” cases together?

Since this is a simple data, we can see this with basic data analysis that “All loans are approved if the CIBIL score is more than 550”

Hence, by learning from historical data, we can formulate a rule. IF CIBIL>550 THEN Approve a loan ELSE reject it.

Best predictor is CIBIL score for this data and the it creates the root node of decision tree — IF CIBIL>550 THEN approve a loan ELSE reject it.

If we pass this data to the Decision Tree algorithm, it finds out this rule by itself and a decision node (IF-ELSE statement) is created as shown below.

If the data is not simple, then the same procedure is repeated until no further splits can be made.

A Decision Tree with three possible outcomes — A Decision Tree with two decision nodes

Q. How to create a Decision Tree for Regression in R?

###### CTREE single Decision Tree REGRESSION #######
library(party)
Model_CTREE=ctree(formula= Weight ~ Hours+Calories,data=DataForMLTrain)

# Measuring Goodness of Fit using R2 value on TRAINING DATA
Orig=DataForMLTrain$Weight
Pred=predict(Model_CTREE, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))
print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data
DataForMLTest$Prediction=predict(Model_CTREE, DataForMLTest)
head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA
LM_APE= 100 *(abs(DataForMLTest$Prediction -DataForMLTest$Weight)/DataForMLTest$Weight)
print(paste('### Mean Accuracy of Decision Tree Model is: ', 100 - mean(LM_APE)))
print(paste('### Median Accuracy of Decision Tree Model is: ', 100 -median(LM_APE)))

###### CTREE single Decision Tree REGRESSION #######

library(party)

Model_CTREE=ctree(formula= Weight ~ Hours+Calories,data=DataForMLTrain)

# Measuring Goodness of Fit using R2 value on TRAINING DATA

Orig=DataForMLTrain$Weight

Pred=predict(Model_CTREE, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))

print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data

DataForMLTest$Prediction=predict(Model_CTREE, DataForMLTest)

head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA

LM_APE= 100 *(abs(DataForMLTest$Prediction -DataForMLTest$Weight)/DataForMLTest$Weight)

print(paste('### Mean Accuracy of Decision Tree Model is: ', 100 - mean(LM_APE)))

print(paste('### Median Accuracy of Decision Tree Model is: ', 100 -median(LM_APE)))

Q. How to create a Decision Tree for Classification in R?

###### CTREE single Decision Tree CLASSIFICATION in R #######
# Creating the model on TRAINING data
library(party)
Model_RPART=ctree(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain)

# Checking Accuracy of model on Testing data
DataForMLTest$Prediction=predict(Model_RPART, DataForMLTest)
head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data
library(caret)
AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only
AccuracyResults[['table']]
AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of Decision Tree Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

###### CTREE single Decision Tree CLASSIFICATION in R #######

# Creating the model on TRAINING data

library(party)

Model_RPART=ctree(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain)

# Checking Accuracy of model on Testing data

DataForMLTest$Prediction=predict(Model_RPART, DataForMLTest)

head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)

AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only

AccuracyResults[['table']]

AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of Decision Tree Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

Q. How to create a Decision Tree for Regression in Python?

###### Single Decision Tree Regression in Python #######
from sklearn import tree
#choose from different tunable hyper parameters
RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion='mse')

#Printing all the parameters of Decision Tree
print(RegModel)

#Creating the model on Training Data
DTree=RegModel.fit(X_train,y_train)
prediction=DTree.predict(X_test)

#Measuring Goodness of fit in Training data
from sklearn import metrics
print('R2 Value:',metrics.r2_score(y_train, DTree.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DTree.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

###### Single Decision Tree Regression in Python #######

from sklearn import tree

#choose from different tunable hyper parameters

RegModel = tree.DecisionTreeRegressor(max_depth=3,criterion='mse')

#Printing all the parameters of Decision Tree

print(RegModel)

#Creating the model on Training Data

DTree=RegModel.fit(X_train,y_train)

prediction=DTree.predict(X_test)

#Measuring Goodness of fit in Training data

from sklearn import metrics

print('R2 Value:',metrics.r2_score(y_train, DTree.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(DTree.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Q. How to create a Decision Tree for Classification in Python?

###### Single Decision Tree CLASSIFICATION in Python #######
import pandas as pd
from sklearn import tree

#choose from different tunable hyper parameters
clf = tree.DecisionTreeClassifier(max_depth=3,criterion='entropy')

#Printing all the parameters of Decision Trees
print(clf)

#Creating the model on Training Data
DTree=clf.fit(X_train,y_train)
prediction=DTree.predict(X_test)

#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DTree.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###### Single Decision Tree CLASSIFICATION in Python #######

import pandas as pd

from sklearn import tree

#choose from different tunable hyper parameters

clf = tree.DecisionTreeClassifier(max_depth=3,criterion='entropy')

#Printing all the parameters of Decision Trees

print(clf)

#Creating the model on Training Data

DTree=clf.fit(X_train,y_train)

prediction=DTree.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(DTree.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Q. Explain how the Random Forest algorithm works?

Random Forests are suitable for both Regression as well as Classification use cases.

Random Forest for Regression in Machine Learning

Random Forest for Classification in Machine Learning

Random forests are basically multiple decision trees put together. This is also known as bagging

To create a Random Forest predictive model, the below steps are followed.

Take some random rows from the data, let’s say 80% of all the rows randomly selected.
Hence every time selecting some different set of rows.
Take some random columns from the above data, let’s say 50% of all the columns randomly selected.
Hence every time selecting some different set of columns.
Create a decision tree using the above data.
Repeat steps 1 to 3 for n number of times (Number of trees). n could be any number like 10 or 50 or 500 etc. (This is known as bagging)
Combine the predictions from each of the trees to get a final answer.
In the case of Regression, the final answer is the average of predictions made by all trees.
In the case of Classification, the final answer is the mode(majority vote) of predictions made by all trees.

These steps ensure that every possible predictor gets a fair chance in the process.
Because we limit the columns used for each tree.
Also, there is very less bias in the model because we select some different set of random rows for each tree.

This above procedure helps to get good accuracy out of the box!

Consider below dataset with 20 records. The target variable is APPROVE_LOAN and predictors are CIBIL, AGE, and SALARY.

The first step is to create smaller samples of the above data by choosing a few random rows and predictor columns while always keeping the target variable in the sample. The number of predictor columns being chosen can be controlled using the parameter “max_features” by default it is the square root of the total number of predictor columns.

Selecting Random Rows and Predictor Columns from the Training data.
APPROVE_LOAN is the target variable, hence it will always be present

The second step is to create independent decision trees on each of the smaller datasets created above and combine the predictions from all of them in the below example, the number of trees (parameter n_estimators) is equal to 3.

The final prediction is the average of all predictions in case of regression and the majority vote in case of classification.

Creating multiple independent Decision Trees which is known as Random Forest

Q. How to create a Random Forest for Regression in R?

###### Random Forest Regression in R #######
library(randomForest)
Model_RF=randomForest(formula=Weight ~ Hours+Calories,data=DataForMLTrain,ntree=10)
 
# Measuring Goodness of Fit using R2 value on TRAINING DATA
Orig=DataForMLTrain$Weight
Pred=predict(Model_RF, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))
print(paste('R2 Value is:',round(R2,4)))
 
# Predictions of model on Testing data
DataForMLTest$Prediction=predict(Model_RF, DataForMLTest)
head(DataForMLTest)
 
# Calculating the Absolute Percentage Error for each prediction in TESTING DATA
LM_APE= 100 *(abs(DataForMLTest$Prediction -DataForMLTest$Weight)/DataForMLTest$Weight)
print(paste('### Mean Accuracy of Random Forest Model is: ', 100 - mean(LM_APE)))
print(paste('### Median Accuracy of Random Forest Model is: ', 100 -median(LM_APE)))

###### Random Forest Regression in R #######

library(randomForest)

Model_RF=randomForest(formula=Weight ~ Hours+Calories,data=DataForMLTrain,ntree=10)

# Measuring Goodness of Fit using R2 value on TRAINING DATA

Orig=DataForMLTrain$Weight

Pred=predict(Model_RF, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))

print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data

DataForMLTest$Prediction=predict(Model_RF, DataForMLTest)

head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA

LM_APE= 100 *(abs(DataForMLTest$Prediction -DataForMLTest$Weight)/DataForMLTest$Weight)

print(paste('### Mean Accuracy of Random Forest Model is: ', 100 - mean(LM_APE)))

print(paste('### Median Accuracy of Random Forest Model is: ', 100 -median(LM_APE)))

Q. How to create a Random Forest for Classification in R?

###### Random Forest Classification in R #######
# Creating the model on TRAINING data
library(randomForest)
Model_RF=randomForest(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain, ntree=10)

# Checking Accuracy of model on Testing data
DataForMLTest$Prediction=predict(Model_RF, DataForMLTest)
head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)
AccuracyResults=confusionMatrix(DataForMLTest$Prediction,DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only
AccuracyResults[['table']]
AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of Random Forest Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

###### Random Forest Classification in R #######

# Creating the model on TRAINING data

library(randomForest)

Model_RF=randomForest(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain, ntree=10)

# Checking Accuracy of model on Testing data

DataForMLTest$Prediction=predict(Model_RF, DataForMLTest)

head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)

AccuracyResults=confusionMatrix(DataForMLTest$Prediction,DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only

AccuracyResults[['table']]

AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of Random Forest Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

Q. How to create a Random Forest for Regression in Python?

###### Random Forest Regression in Python #######
from sklearn.ensemble import RandomForestRegressor
RegModel = RandomForestRegressor(n_estimators=100,criterion='mse')

#Printing all the parameters of Random Forest
print(RegModel)

#Creating the model on Training Data
RF=RegModel.fit(X_train,y_train)
prediction=RF.predict(X_test)

#Measuring Goodness of fit in Training data
from sklearn import metrics
print('R2 Value:',metrics.r2_score(y_train, RF.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

###### Random Forest Regression in Python #######

from sklearn.ensemble import RandomForestRegressor

RegModel = RandomForestRegressor(n_estimators=100,criterion='mse')

#Printing all the parameters of Random Forest

print(RegModel)

#Creating the model on Training Data

RF=RegModel.fit(X_train,y_train)

prediction=RF.predict(X_test)

#Measuring Goodness of fit in Training data

from sklearn import metrics

print('R2 Value:',metrics.r2_score(y_train, RF.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(RF.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Q. How to create a Random Forest for Classification in Python?

###### Random Forest Classification in Python #######
#Random Forest (Bagging of multiple Decision Trees)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100,criterion='gini')

#Printing all the parameters of Random Forest
print(clf)

#Creating the model on Training Data
RF=clf.fit(X_train,y_train)
prediction=RF.predict(X_test)

#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###### Random Forest Classification in Python #######

#Random Forest (Bagging of multiple Decision Trees)

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100,criterion='gini')

#Printing all the parameters of Random Forest

print(clf)

#Creating the model on Training Data

RF=clf.fit(X_train,y_train)

prediction=RF.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(RF.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Q. Explain how the Adaboost algorithm works?

Adaboost is the short form for adaptive boosting.
It is an algorithm which ensembles(combines) multiple simple predictive models also known as weak learners to generate a final strong model.

Adaboost for Regression in Machine Learning

Adaboost for Classification in Machine Learning

Adaboost algorithm ensembles(combines) multiple simple predictive models also known as weak learners to generate a final strong model.

A decision tree with one level, also known as decision stumps is the most popular weak learner algorithm used in AdaBoost

These are called stumps because these are so simple that when you plot them you just a line (stump!)

Decision Tree with one level is called a decision stump. — A Decision Stump(D1)

Look at the image, here D1 is a decision stump.

Anything which is on the left side of D1 will be classified as positive (+) and anything which is on the right side will be classified as negative (-)

You can observe that it is not very efficient, because there are some positive values on the right side as well.

This is why multiple stumps are combined to create a final model which does a better job of classification

Adaboost is suitable for binary classification problems, however, you can use it for multi-class classification or regression as well.

The basic idea behind Adaboost is Boosting. It works with below listed steps

Create a first predictive model on original data
Create a second predictive model which corrects the mistakes of the first model
Create a third predictive model which corrects the mistakes of the second model
Keep creating more models until 100% accuracy is achieved OR the max number of iterations(number of trees) is reached

Adaboost take the general concept of boosting and adds a twist to it by assigning weights to data and models.

Adaboost take this general concept of boosting and adds a twist to it by assigning weights to data and models. Below listed steps describe the working of Adaboost.

Take the original training data and assign a weight of 1/n to all rows. (n=total number of rows in Training data)
Randomly select a subset of rows from the original data and create a predictive model(decision stump)
Generate predictions on the above subset using the above predictive model
Update the weights in the original data. Give higher weights to those rows where the prediction was incorrect, and give lesser weights to those rows where the prediction was correct. Give the current model higher weight if it is accurate, otherwise, give it less weight
Select the rows from original data based on weights. Those with higher weights get selected first.
These rows signify the mistakes by the previous predictive model.
Create a predictive model on the above data.
Repeat steps 3-6 until 100% accuracy is achieved OR the max number of iterations (number of trees) is reached
The final prediction is the weighted average of all predictions made by different models created above.

Below flowchart visualizes the above-listed steps for Adaboost.

Q. How to create Adaboost for Regression in R?

There is no stable open-source version of Adaboost implementation for Regression in R language as of now.

Q. How to create Adaboost for Classification in R?

###### Adaboost Classification in R #######
# Creating the model on TRAINING data
library(caret)
Model_ADA=train(APPROVE_LOAN ~CIBIL+AGE+SALARY, data=DataForMLTrain, method='adaboost')

# Checking Accuracy of model on Testing data
DataForMLTest$Prediction=predict(Model_ADA, DataForMLTest)
head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data
AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only
AccuracyResults[['table']]
AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of ADABOOST Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

###### Adaboost Classification in R #######

# Creating the model on TRAINING data

library(caret)

Model_ADA=train(APPROVE_LOAN ~CIBIL+AGE+SALARY, data=DataForMLTrain, method='adaboost')

# Checking Accuracy of model on Testing data

DataForMLTest$Prediction=predict(Model_ADA, DataForMLTest)

head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only

AccuracyResults[['table']]

AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of ADABOOST Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

Q. How to create Adaboost for Regression in Python?

###### Adaboost Regression in Python #######
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

#Choosing Decision Tree with 1 level as the weak learner
DTR=DecisionTreeRegressor(max_depth=1)
RegModel = AdaBoostRegressor(n_estimators=50, base_estimator=DTR ,learning_rate=1)

#Printing all the parameters of Adaboost
print(RegModel)

#Creating the model on Training Data
AB=RegModel.fit(X_train,y_train)
prediction=AB.predict(X_test)

#Measuring Goodness of fit in Training data
from sklearn import metrics
print('R2 Value:',metrics.r2_score(y_train, AB.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(AB.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

###### Adaboost Regression in Python #######

from sklearn.ensemble import AdaBoostRegressor

from sklearn.tree import DecisionTreeRegressor

#Choosing Decision Tree with 1 level as the weak learner

DTR=DecisionTreeRegressor(max_depth=1)

RegModel = AdaBoostRegressor(n_estimators=50, base_estimator=DTR ,learning_rate=1)

#Printing all the parameters of Adaboost

print(RegModel)

#Creating the model on Training Data

AB=RegModel.fit(X_train,y_train)

prediction=AB.predict(X_test)

#Measuring Goodness of fit in Training data

from sklearn import metrics

print('R2 Value:',metrics.r2_score(y_train, AB.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(AB.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Q. How to create Adaboost for Classification in Python?

###### Adaboost Classification in Python #######
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

#Choosing Decision Tree with 1 level as the weak learner
DTC=DecisionTreeClassifier(max_depth=1)
clf = AdaBoostClassifier(n_estimators=50, base_estimator=DTC ,learning_rate=1)

#Printing all the parameters of Adaboost
print(clf)

#Creating the model on Training Data
AB=clf.fit(X_train,y_train)
prediction=AB.predict(X_test)

#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(AB.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###### Adaboost Classification in Python #######

import pandas as pd

from sklearn.ensemble import AdaBoostClassifier

from sklearn.tree import DecisionTreeClassifier

#Choosing Decision Tree with 1 level as the weak learner

DTC=DecisionTreeClassifier(max_depth=1)

clf = AdaBoostClassifier(n_estimators=50, base_estimator=DTC ,learning_rate=1)

#Printing all the parameters of Adaboost

print(clf)

#Creating the model on Training Data

AB=clf.fit(X_train,y_train)

prediction=AB.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(AB.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Q. Explain how the XGBoost algorithm works?

XGboost is the short form for Xtreme Gradient Boosting.

It is one of the most popular algorithms in machine learning universe. It is used by winners at most of the hackathons!
The reason for its popularity is listed below

Handles large datasets with ease
Trains very fast
High accuracy for most of the datasets

XGboost for Regression in Machine Learning

XGboost for Classification in Machine Learning

The basic logic behind XGboost is Boosting.

Let us quickly revise how boosting works in general.

Create a Predictive model M1.
Create another Predictive model M2 which corrects the mistakes of the previous model M1. Hence M2 is the boosted version of M1.
Create another Predictive model M3 which corrects the mistakes of the previous model M2.
Keep repeating the process until the model is 100% accurate OR the maximum number of allowed iteration(number of trees) is reached.

The unique thing about XGBoost is how exactly it corrects the mistakes of the previous model.

The key differentiation about XGB is how exactly it corrects the mistakes of the previous model. In previous algorithm Adaboost, I discussed how it gives higher weights to the incorrectly predicted rows, and the next model focuses to reduce that.

In XGboost, the approach is focussed on reducing the error gradient (Mistakes of the previous model).

XGboost works with below-listed steps.

How xgboost works — XGBoost algorithm working flow

Create a predictive model M1 on full Training data with the target variable “y”. (Notice that there is no random sampling here. XGBoost uses full training data.)
Find the difference between the values of the original target variable “y” and predicted target variable “y1” as “e1 = y – y1”
Create another predictive model M2 with the target variable as e1 which is “y – y1”. i.e. the gradient of the mistakes by previous model M1.
Find the difference between the values of the target variable “y-y1” and predicted target variable “y2” as “e2 = y – y1 -y2”
Create another predictive model M3 with the target variable as e2 which is “y – y1 -y2”. i.e. the gradient of the mistakes by previous model M2.
Keep repeating these steps until the error becomes zero OR the maximum number of iterations is reached.
The final model will exhibit the least amount of error and hence it will provide the maximum accuracy by learning from the mistakes of all previous models.
The final prediction will be the sum of the predictions from all models. y_final = y1 + y2 + y3 …

This seems misleading at once but if you observe closely the Biggest value is y1 and all other values are adjustments to the y1 since these are the predictions for the errors and not the original target variable “y”.

Hence, the sum of all makes it more accurate since every next model tries to minimize the errors made by the previous model.

This is why XGboost is also known as an additive model since you keep adding weak learner models on top of the first one and in the end, you get a strong predictive model.

Xgboost keeps adding weak learner models on top of the first model and in the end, you get a strong predictive model.

Q. How to create XGBoost for Regression in R?

###### Xgboost Regression in R #######
# Performing sampling again to create a fresh copy for XGBOOST
# Sampling | Splitting data into 70% for training 30% for testing
TrainingSampleIndex=sample(1:nrow(GymData), size=0.7 * nrow(GymData) )
DataForMLTrain=GymData[TrainingSampleIndex, ]
DataForMLTest=GymData[-TrainingSampleIndex, ]
dim(DataForMLTrain)
dim(DataForMLTest)

# XGBOOST requires data input in matrix format
# XGBOOST works only on Numeric data so its mandatory to convert every factor PREDICTOR to numeric

# Using Dummy variables
library(dummies)
DataForMLTrainDummy=dummy.data.frame(DataForMLTrain, sep='_', drop=F)
DataForMLTestDummy=dummy.data.frame(DataForMLTest, sep='_', drop=F)

# Making Sure the Datatype is numeric for each column
DataForMLTrainDummy=sapply(DataForMLTrainDummy,as.numeric)
DataForMLTestDummy=sapply(DataForMLTestDummy,as.numeric)

# Choosing all other colums except target variable
PredictorDataForTraining &lt;- data.matrix(DataForMLTrainDummy[, -3])
PredictorDataForTesting &lt;- data.matrix(DataForMLTestDummy[, -3])

# Choosing only the TargetVariable from data
TargetVariableForTraining&lt;- data.matrix(DataForMLTrainDummy[, 3])

# Training the XGboost model in Training data
Model_XGB=xgboost(data=PredictorDataForTraining, label=TargetVariableForTraining,&nbsp;
 eta = 0.02, max_depth = 100 ,nround=1000, subsample=0.95)

# Measuring Goodness of Fit using R2 value on TRAINING DATA
Orig=DataForMLTrain$Weight
Pred=predict(Model_XGB, PredictorDataForTraining)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))
print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data
DataForMLTest$Prediction=predict(Model_XGB, PredictorDataForTesting)
head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA
LM_APE= 100 *(abs(DataForMLTest$Prediction - DataForMLTest$Weight)/DataForMLTest$Weight)
print(paste('### Mean Accuracy of XGBOOST Model is: ', 100 - mean(LM_APE)))
print(paste('### Median Accuracy of XGBOOST Model is: ', 100 - median(LM_APE)))

###### Xgboost Regression in R #######

# Performing sampling again to create a fresh copy for XGBOOST

# Sampling | Splitting data into 70% for training 30% for testing

TrainingSampleIndex=sample(1:nrow(GymData), size=0.7 * nrow(GymData) )

DataForMLTrain=GymData[TrainingSampleIndex, ]

DataForMLTest=GymData[-TrainingSampleIndex, ]

dim(DataForMLTrain)

dim(DataForMLTest)

# XGBOOST requires data input in matrix format

# XGBOOST works only on Numeric data so its mandatory to convert every factor PREDICTOR to numeric

# Using Dummy variables

library(dummies)

DataForMLTrainDummy=dummy.data.frame(DataForMLTrain, sep='_', drop=F)

DataForMLTestDummy=dummy.data.frame(DataForMLTest, sep='_', drop=F)

# Making Sure the Datatype is numeric for each column

DataForMLTrainDummy=sapply(DataForMLTrainDummy,as.numeric)

DataForMLTestDummy=sapply(DataForMLTestDummy,as.numeric)

# Choosing all other colums except target variable

PredictorDataForTraining <- data.matrix(DataForMLTrainDummy[, -3])

PredictorDataForTesting <- data.matrix(DataForMLTestDummy[, -3])

# Choosing only the TargetVariable from data

TargetVariableForTraining<- data.matrix(DataForMLTrainDummy[, 3])

# Training the XGboost model in Training data

Model_XGB=xgboost(data=PredictorDataForTraining, label=TargetVariableForTraining, 

eta = 0.02, max_depth = 100 ,nround=1000, subsample=0.95)

# Measuring Goodness of Fit using R2 value on TRAINING DATA

Orig=DataForMLTrain$Weight

Pred=predict(Model_XGB, PredictorDataForTraining)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))

print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data

DataForMLTest$Prediction=predict(Model_XGB, PredictorDataForTesting)

head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA

LM_APE= 100 *(abs(DataForMLTest$Prediction - DataForMLTest$Weight)/DataForMLTest$Weight)

print(paste('### Mean Accuracy of XGBOOST Model is: ', 100 - mean(LM_APE)))

print(paste('### Median Accuracy of XGBOOST Model is: ', 100 - median(LM_APE)))

Q. How to create XGBoost for Classification in R?

###### Xgboost Classification in R #######
library(xgboost)
# Performing sampling again to create a fresh copy for XGBOOST
# Sampling | Splitting data into 70% for training 30% for testing
TrainingSampleIndex=sample(1:nrow(LoanData), size=0.7 * nrow(LoanData) )
DataForMLTrain=LoanData[TrainingSampleIndex, ]
DataForMLTest=LoanData[-TrainingSampleIndex, ]
dim(DataForMLTrain)
dim(DataForMLTest)

# XGBOOST requires data input in matrix format
# XGBOOST works only on Numeric data so its mandatory to convert every factor PREDICTOR to numeric

# Using Dummy variables
library(dummies)
DataForMLTrainDummy=dummy.data.frame(DataForMLTrain[,-4], sep='_', drop=F)
DataForMLTestDummy=dummy.data.frame(DataForMLTest[, -4], sep='_', drop=F)

PredictorDataForTraining &lt;- data.matrix(DataForMLTrainDummy[, -4])
PredictorDataForTesting &lt;- data.matrix(DataForMLTestDummy[, -4])
head(PredictorDataForTraining)
 
# Choosing only the TargetVariable from data
TargetVariableForTraining&lt;- DataForMLTrain[, 4]
numberOfClasses=length(unique(TargetVariableForTraining))
head(TargetVariableForTraining)
table(TargetVariableForTraining)
 
# The XGBoost algorithm requires that the class values must start at 0 and increase sequentially to the maximum number of&nbsp;
# classes. This is a bit of an inconvenience as you need to keep track of what name goes with which number.&nbsp;
# Also, you need to be very careful when you add or remove a 1 to go from the zero based labels to the 1 based labels.
 
# Converting all factor values to numeric values
TargetVariableForTraining=as.numeric(TargetVariableForTraining)
table(TargetVariableForTraining)

# Subtracting 1 to start class values from 0
# Make sure you note down what number represents which factor values
TargetVariableForTraining=TargetVariableForTraining-1
table(TargetVariableForTraining)

# Converting to Data Matrix format
TargetVariableForTraining=data.matrix(TargetVariableForTraining)

xgb_params &lt;- list("objective" = "multi:softmax","eval_metric" = "mlogloss","num_class" = numberOfClasses)

# Creating the model on TRAINING data
Model_XGB=xgboost(data=PredictorDataForTraining, label=TargetVariableForTraining, nrounds=10, params=xgb_params)

# Checking Accuracy of model on Testing data
DataForMLTest$Prediction=predict(Model_XGB, PredictorDataForTesting)
head(DataForMLTest)

# checking the training data to see the representation of each class
NumberRepresentation=as.numeric(names(table(TargetVariableForTraining)))
CharacterRepresentation=names(table(DataForMLTrain[, 4]))
ValuesMapping=data.frame(NumberRepresentation,CharacterRepresentation)
print(ValuesMapping)

# Replacing all numbers with their character classes
 for(i in 1:nrow(ValuesMapping)){
 DataForMLTest$Prediction=gsub(ValuesMapping[i,1],ValuesMapping[i,2],DataForMLTest$Prediction)
 }
DataForMLTest$Prediction=as.factor(DataForMLTest$Prediction)
head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data
library(caret)
AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")
 
# Since AccuracyResults is a list of multiple items, fetching useful components only
AccuracyResults[['table']]
AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of XGBOOST Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

###### Xgboost Classification in R #######

library(xgboost)

# Performing sampling again to create a fresh copy for XGBOOST

# Sampling | Splitting data into 70% for training 30% for testing

TrainingSampleIndex=sample(1:nrow(LoanData), size=0.7 * nrow(LoanData) )

DataForMLTrain=LoanData[TrainingSampleIndex, ]

DataForMLTest=LoanData[-TrainingSampleIndex, ]

dim(DataForMLTrain)

dim(DataForMLTest)

# XGBOOST requires data input in matrix format

# XGBOOST works only on Numeric data so its mandatory to convert every factor PREDICTOR to numeric

# Using Dummy variables

library(dummies)

DataForMLTrainDummy=dummy.data.frame(DataForMLTrain[,-4], sep='_', drop=F)

DataForMLTestDummy=dummy.data.frame(DataForMLTest[, -4], sep='_', drop=F)

PredictorDataForTraining <- data.matrix(DataForMLTrainDummy[, -4])

PredictorDataForTesting <- data.matrix(DataForMLTestDummy[, -4])

head(PredictorDataForTraining)

# Choosing only the TargetVariable from data

TargetVariableForTraining<- DataForMLTrain[, 4]

numberOfClasses=length(unique(TargetVariableForTraining))

head(TargetVariableForTraining)

table(TargetVariableForTraining)

# The XGBoost algorithm requires that the class values must start at 0 and increase sequentially to the maximum number of 

# classes. This is a bit of an inconvenience as you need to keep track of what name goes with which number. 

# Also, you need to be very careful when you add or remove a 1 to go from the zero based labels to the 1 based labels.

# Converting all factor values to numeric values

TargetVariableForTraining=as.numeric(TargetVariableForTraining)

table(TargetVariableForTraining)

# Subtracting 1 to start class values from 0

# Make sure you note down what number represents which factor values

TargetVariableForTraining=TargetVariableForTraining-1

table(TargetVariableForTraining)

# Converting to Data Matrix format

TargetVariableForTraining=data.matrix(TargetVariableForTraining)

xgb_params <- list("objective" = "multi:softmax","eval_metric" = "mlogloss","num_class" = numberOfClasses)

# Creating the model on TRAINING data

Model_XGB=xgboost(data=PredictorDataForTraining, label=TargetVariableForTraining, nrounds=10, params=xgb_params)

# Checking Accuracy of model on Testing data

DataForMLTest$Prediction=predict(Model_XGB, PredictorDataForTesting)

head(DataForMLTest)

# checking the training data to see the representation of each class

NumberRepresentation=as.numeric(names(table(TargetVariableForTraining)))

CharacterRepresentation=names(table(DataForMLTrain[, 4]))

ValuesMapping=data.frame(NumberRepresentation,CharacterRepresentation)

print(ValuesMapping)

# Replacing all numbers with their character classes

for(i in 1:nrow(ValuesMapping)){

DataForMLTest$Prediction=gsub(ValuesMapping[i,1],ValuesMapping[i,2],DataForMLTest$Prediction)

}

DataForMLTest$Prediction=as.factor(DataForMLTest$Prediction)

head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)

AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only

AccuracyResults[['table']]

AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of XGBOOST Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

Q. How to create XGBoost for Regression in Python?

###### Xgboost Regression in Python #######
from xgboost import XGBRegressor
RegModel=XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=500, objective='reg:linear', booster='gbtree')

#Printing all the parameters of XGBoost
print(RegModel)

#Creating the model on Training Data
XGB=RegModel.fit(X_train,y_train)
prediction=XGB.predict(X_test)

#Measuring Goodness of fit in Training data
from sklearn import metrics
print('R2 Value:',metrics.r2_score(y_train, XGB.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(XGB.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

###### Xgboost Regression in Python #######

from xgboost import XGBRegressor

RegModel=XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=500, objective='reg:linear', booster='gbtree')

#Printing all the parameters of XGBoost

print(RegModel)

#Creating the model on Training Data

XGB=RegModel.fit(X_train,y_train)

prediction=XGB.predict(X_test)

#Measuring Goodness of fit in Training data

from sklearn import metrics

print('R2 Value:',metrics.r2_score(y_train, XGB.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(XGB.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Q. How to create XGBoost for Classification in Python?

###### Xgboost Classification in Python #######
import pandas as pd
from xgboost import XGBClassifier
clf=XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=500, objective='binary:logistic', booster='gbtree')

#Printing all the parameters of XGBoost
print(clf)

#Creating the model on Training Data
XGB=clf.fit(X_train,y_train)
prediction=XGB.predict(X_test)

#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(XGB.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###### Xgboost Classification in Python #######

import pandas as pd

from xgboost import XGBClassifier

clf=XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=500, objective='binary:logistic', booster='gbtree')

#Printing all the parameters of XGBoost

print(clf)

#Creating the model on Training Data

XGB=clf.fit(X_train,y_train)

prediction=XGB.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns

%matplotlib inline

feature_importances = pd.Series(XGB.feature_importances_, index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Q Explain how the KNN algorithm works?

KNN stands for K Nearest Neighbours. As the name suggests this algorithm tries to classify a new case based on K nearby points.

A Practical range for K is 2 to 10. If K=3, it means KNN will try to look for 3 nearest points.
It consists of 3 simple steps for any new point

Find the most K number of similar(closest) points.
Find the count of each class in those K points.
Classify the new point as that class which is present the maximum number of times in these K points.
In the case of regression, take the average of nearest “K” points.

KNN algorithm in machine learning works by looking at nearest "K" Neighbours — KNN algorithm in machine learning

As you can see in the figure. The cross is the new point and we choose K=3.
KNN will look at the nearby 3 points. It can be seen that 2 points are blue out of 3. Hence the new point is assigned to the blue class.

The next question is, how does KNN find the nearest neighbour?
It does so by measuring distances between the points.

Distance between two points can be calculated using any one of the below methods.

Euclidean Distance: Take the difference between the coordinates of points and add it after squaring.
E.g. Two points A(2,3) and B(6,5) will have a Euclidean distance of (2-6)² + (3-5)² = 20

Manhattan Distance: The sum of absolute differences between the coordinates of points.
E.g. Two points A(2,3) and B(6,5) will have a Manhattan distance of |2-6| + |3-5| = 6

Minkowski Distance: It is the generalization of both Euclidean and Manhattan Distance. A parameter “q” drives the behaviour of the below formula.
If q = 1, Manhattan Distance
If q = 2, Euclidean Distance

Q. How to create a KNN model for Regression in R?

###### K-Nearest Neighbour(KNN) Regression in R #######
library(FNN)
library(dummies)
DataForMLTrainDummy=dummy.data.frame(DataForMLTrain, sep='_', drop=F)
DataForMLTestDummy=dummy.data.frame(DataForMLTest, sep='_', drop=F)
 
# Making Sure the Datatype is numeric for each column
DataForMLTrainDummy=sapply(DataForMLTrainDummy,as.numeric)
DataForMLTestDummy=sapply(DataForMLTestDummy,as.numeric)
 
# Choosing all other colums except target variable
PredictorDataForTraining &lt;- data.matrix(DataForMLTrainDummy[, -3])
PredictorDataForTesting &lt;- data.matrix(DataForMLTestDummy[, -3])

# Choosing only the TargetVariable from data
TargetVariableForTraining&lt;- data.matrix(DataForMLTrainDummy[, 3])

# Fitting the predictive model on training data
Model_KNN=knn.reg(train=PredictorDataForTraining, y=TargetVariableForTraining, k=3)

# Measuring Goodness of Fit using R2 value on TRAINING DATA
Orig=DataForMLTrain$Weight
Pred=Model_KNN$pred

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))
print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data
Model_KNN=knn.reg(train=PredictorDataForTraining, y=TargetVariableForTraining,&nbsp; test=PredictorDataForTesting, k=3)

DataForMLTest$Prediction=Model_KNN$pred
head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA
LM_APE= 100 *(abs(DataForMLTest$Prediction -DataForMLTest$Weight)/DataForMLTest$Weight)
print(paste('### Mean Accuracy of KNN Model is: ', 100 - mean(LM_APE)))
print(paste('### Median Accuracy of KNN Model is: ', 100 - median(LM_APE)))

###### K-Nearest Neighbour(KNN) Regression in R #######

library(FNN)

library(dummies)

DataForMLTrainDummy=dummy.data.frame(DataForMLTrain, sep='_', drop=F)

DataForMLTestDummy=dummy.data.frame(DataForMLTest, sep='_', drop=F)

# Making Sure the Datatype is numeric for each column

DataForMLTrainDummy=sapply(DataForMLTrainDummy,as.numeric)

DataForMLTestDummy=sapply(DataForMLTestDummy,as.numeric)

# Choosing all other colums except target variable

PredictorDataForTraining <- data.matrix(DataForMLTrainDummy[, -3])

PredictorDataForTesting <- data.matrix(DataForMLTestDummy[, -3])

# Choosing only the TargetVariable from data

TargetVariableForTraining<- data.matrix(DataForMLTrainDummy[, 3])

# Fitting the predictive model on training data

Model_KNN=knn.reg(train=PredictorDataForTraining, y=TargetVariableForTraining, k=3)

# Measuring Goodness of Fit using R2 value on TRAINING DATA

Orig=DataForMLTrain$Weight

Pred=Model_KNN$pred

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))

print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data

Model_KNN=knn.reg(train=PredictorDataForTraining, y=TargetVariableForTraining,  test=PredictorDataForTesting, k=3)

DataForMLTest$Prediction=Model_KNN$pred

head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA

LM_APE= 100 *(abs(DataForMLTest$Prediction -DataForMLTest$Weight)/DataForMLTest$Weight)

print(paste('### Mean Accuracy of KNN Model is: ', 100 - mean(LM_APE)))

print(paste('### Median Accuracy of KNN Model is: ', 100 - median(LM_APE)))

Q. How to create a KNN model for Classification in R?

###### K-Nearest Neighbour(KNN) Classification in R #######
# Preparing Data for KNN
X_train=DataForMLTrain[,c('CIBIL','AGE','SALARY')]
X_test=DataForMLTest[,c('CIBIL','AGE','SALARY')]
y_train=DataForMLTrain[,c('APPROVE_LOAN')]
y_test=DataForMLTest[,c('APPROVE_LOAN')]
 
# Creating the model on TRAINING data
library(class)
Model_KNN=knn(train=X_train, test=X_test, cl=y_train , k=3)

# Checking Accuracy of model on Testing data
DataForMLTest$Prediction=Model_KNN
head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data
library(caret)

AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only
AccuracyResults[['table']]
AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of KNN Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

###### K-Nearest Neighbour(KNN) Classification in R #######

# Preparing Data for KNN

X_train=DataForMLTrain[,c('CIBIL','AGE','SALARY')]

X_test=DataForMLTest[,c('CIBIL','AGE','SALARY')]

y_train=DataForMLTrain[,c('APPROVE_LOAN')]

y_test=DataForMLTest[,c('APPROVE_LOAN')]

# Creating the model on TRAINING data

library(class)

Model_KNN=knn(train=X_train, test=X_test, cl=y_train , k=3)

# Checking Accuracy of model on Testing data

DataForMLTest$Prediction=Model_KNN

head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)

AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only

AccuracyResults[['table']]

AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of KNN Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

Q. How to create a KNN model for Regression in Python?

###### K-Nearest Neighbour(KNN) Regression in Python #######
from sklearn.neighbors import KNeighborsRegressor
RegModel = KNeighborsRegressor(n_neighbors=2)

#Printing all the parameters of KNN
print(RegModel)

#Creating the model on Training Data
KNN=RegModel.fit(X_train,y_train)
prediction=KNN.predict(X_test)

#Measuring Goodness of fit in Training data
from sklearn import metrics
print('R2 Value:',metrics.r2_score(y_train, KNN.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns
#There is no built-in method to get feature importance in KNN

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

###### K-Nearest Neighbour(KNN) Regression in Python #######

from sklearn.neighbors import KNeighborsRegressor

RegModel = KNeighborsRegressor(n_neighbors=2)

#Printing all the parameters of KNN

print(RegModel)

#Creating the model on Training Data

KNN=RegModel.fit(X_train,y_train)

prediction=KNN.predict(X_test)

#Measuring Goodness of fit in Training data

from sklearn import metrics

print('R2 Value:',metrics.r2_score(y_train, KNN.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns

#There is no built-in method to get feature importance in KNN

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Q. How to create a KNN model for Classification in Python?

###### K-Nearest Neighbour(KNN) Classification in Python #######
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=2)

#Printing all the parameters of KNN
print(clf)

#Creating the model on Training Data
KNN=clf.fit(X_train,y_train)
prediction=KNN.predict(X_test)

#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns
#There is no built-in method to get feature importance in KNN

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###### K-Nearest Neighbour(KNN) Classification in Python #######

import pandas as pd

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=2)

#Printing all the parameters of KNN

print(clf)

#Creating the model on Training Data

KNN=clf.fit(X_train,y_train)

prediction=KNN.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns

#There is no built-in method to get feature importance in KNN

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Q Explain how the SVM algorithm works?

The support vector machine (SVM) is a generalization of a simple classifier called the maximal margin classifier.

SVM is preferred for classification, however, you can use it for regression as well.
Before you jump into SVM, the building blocks of SVM needs to be cleared which are listed below:

Hyperplane
Maximum Margin Classifier
Support Vector Classifier
Support Vector Machines

Maximum Margin Classifier can be used only when the classes are separated by a linear boundary, for example, look below. Out of the three lines possible to separate the red and blue regions, only one is the best line.

Only one line is the best separator of blue and red dots.

The lines are known as a Hyperplanes. Only one of the hyperplanes is the best one.

Concept of a Separating Hyperplane:

If there are “N” number of dimensions then, a hyperplane defined inside it has “N-1” dimensions.

If there are “N” number of dimensions then, a hyperplane defined inside it has “N-1” dimensions.

For example, if we have 2-Dimensions then the hyperplane will be a straight line drawn in it.
Look at the diagram below, The equation of the line drawn is 1 + 2X1 + 3X2 = 0. The two dimensions are X1(X-Axis) and X2(Y-Axis).

Now, if you choose some values of X1 and X2 such that 1+2X1 +3X2 > 0, then you will get the Blue region. These points are “Above” the line.

Again, if you choose some values of X1 and X2 such that 1+2X1 +3X2 < 0, then you will get the Red region. These points are “Below” the line.

You can say this line(1 + 2X1 + 3X2 = 0) is “Separating” the Red Points from the Blue Points.
Hence, this line is also known as a “Separating Hyperplane”

This Separating hyperplane also acts as a classifier. Let’s say a new point (X1n, X2n) needs to be assigned a class (Red or Blue)

If 1+2X1n +3X2n > 0 then it belongs to Blue region/Blue Class, and if If 1+2X1n +3X2n < 0 then it belongs to the Red Region/Red Class

For example, consider the point (X1=1, X2=1.2)
Substituting the values in the equation 1 + 2X1 + 3X2 we get 1+ 2*1 + 3 *1.2 = 6.6.
Now, because 6.6 >0 The point (X1=1, X2=1.2) belongs to Blue class / Blue Region

Maximum Margin classifier

Find a Separating Hyperplane in an N-dimensional space. While maximizing the perpendicular distances of the hyperplane from all the points present in training data.

That distance which is the smallest of all the perpendicular distances is known as “margin”.

A Maximum margin classifier is created when which has the largest margin.
Basically, it has the largest distance from the closest points of both all the classes.

Look at the diagram below with 2-dimensional space. With Red and Blue points. The maximum margin classifier is the one
which has maximum margin, i.e. the distance from the closest points of both classes.

Maximum Margin Classifier is the thick black line and the points on the dotted line are support vectors

Support Vector Classifier

Sometimes there is no “Separating Hyperplane” which clearly separates the data.
This is because the data points are jumbled up and there is no clear boundary between the classes.

Take a look below where the red and blue points are mixed up.
In this case, a generalized version of Maximum margin classifier is used, which is also known as support vector classifier.

Red and Blue classes are not clearly separable

The maximum margin classifier tries to classify each and every point perfectly and while doing so it can overfit and may not remain generic.
This is where the concept of “Soft margin” is applied.

“Soft margin” means the classifier is allowing a few wrong classifications in order to make the model generic.

The idea behind support vector classifier is to use a “soft margin”.

Support Vector Classifier allows some errors while Maximum Margin Classifier tries to perfectly classify the points

The amount of softness is controlled by a non-negative tuning parameter called “C”. If the value of C is high that means the hyperplane can have more misclassifications. If the value of C is less, then only a few misclassifications are allowed.

If the value of C=0 then it does not allow any misclassifications and hence becomes the maximum margin classifier.

“Soft margin” means the classifier is allowing a few wrong classifications in order to make the model generic.

Support Vector Machines

So far the classes were separated by a linear boundary. What if the separating boundary is non-linear?

In such cases, we should try to find an equation which is non-linear, to separate the classes.

This equation is found using “kernels” and the resulting classifier is known as support vector machines. The above diagram shows an SVM with a radial kernel, there could be linear kernels as well.

Hence, you can say that support vector machines are the extension of support vector classifier where the original data is transformed to a higher degree. e.g. linear to quadratic or cubic. so that It becomes linearly separable in that higher dimension.

Support vector machines are the extension of support vector classifier where the original data is transformed to a higher degree.

Q. How to create an SVM model for Regression in R?

###### Support Vector Machines(SVM) Regression in R #######
library(e1071)
Model_SVM=svm(formula= Weight ~ Hours+Calories,data=DataForMLTrain)

# Measuring Goodness of Fit using R2 value on TRAINING DATA
Orig=DataForMLTrain$Weight
Pred=predict(Model_SVM, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))
print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data
DataForMLTest$Prediction=predict(Model_SVM, DataForMLTest)
head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA
LM_APE= 100 *(abs(DataForMLTest$Prediction - DataForMLTest$Weight)/DataForMLTest$Weight)
print(paste('### Mean Accuracy of SVM Model is: ', 100 - mean(LM_APE)))
print(paste('### Median Accuracy of SVM Model is: ', 100 - median(LM_APE)))

###### Support Vector Machines(SVM) Regression in R #######

library(e1071)

Model_SVM=svm(formula= Weight ~ Hours+Calories,data=DataForMLTrain)

# Measuring Goodness of Fit using R2 value on TRAINING DATA

Orig=DataForMLTrain$Weight

Pred=predict(Model_SVM, DataForMLTrain)

R2= 1 - (sum((Orig-Pred)^2)/sum((Orig-mean(Orig))^2))

print(paste('R2 Value is:',round(R2,4)))

# Predictions of model on Testing data

DataForMLTest$Prediction=predict(Model_SVM, DataForMLTest)

head(DataForMLTest)

# Calculating the Absolute Percentage Error for each prediction in TESTING DATA

LM_APE= 100 *(abs(DataForMLTest$Prediction - DataForMLTest$Weight)/DataForMLTest$Weight)

print(paste('### Mean Accuracy of SVM Model is: ', 100 - mean(LM_APE)))

print(paste('### Median Accuracy of SVM Model is: ', 100 - median(LM_APE)))

Q. How to create an SVM model for Classification in R?

###### Support Vector Machines(SVM) Classification in R #######
# Creating the model on TRAINING data
library(e1071)
Model_SVM=svm(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain)

# Checking Accuracy of model on Testing data
DataForMLTest$Prediction=predict(Model_SVM, DataForMLTest)
head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data
library(caret)
AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only
AccuracyResults[['table']]
AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of SVM Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

###### Support Vector Machines(SVM) Classification in R #######

# Creating the model on TRAINING data

library(e1071)

Model_SVM=svm(APPROVE_LOAN~CIBIL+AGE+SALARY, data=DataForMLTrain)

# Checking Accuracy of model on Testing data

DataForMLTest$Prediction=predict(Model_SVM, DataForMLTest)

head(DataForMLTest)

# Creating the Confusion Matrix to calculate overall accuracy, precision and recall on TESTING data

library(caret)

AccuracyResults=confusionMatrix(DataForMLTest$Prediction, DataForMLTest$APPROVE_LOAN, mode = "prec_recall")

# Since AccuracyResults is a list of multiple items, fetching useful components only

AccuracyResults[['table']]

AccuracyResults[['byClass']]

print(paste('### Overall Accuracy of SVM Model is: ', round(100 * AccuracyResults[['overall']][1]) , '%'))

Q. How to create an SVM model for Regression in Python?

###### Support Vector Machines(SVM) Regression in Python #######
import pandas as pd
from sklearn import svm
RegModel = svm.SVR(C=2, kernel='linear')

#Printing all the parameters of KNN
print(RegModel)

#Creating the model on Training Data
SVM=RegModel.fit(X_train,y_train)
prediction=SVM.predict(X_test)

#Measuring Goodness of fit in Training data
from sklearn import metrics
print('R2 Value:',metrics.r2_score(y_train, SVM.predict(X_train)))

#Measuring accuracy on Testing Data
print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns
#The built in attribute SVM.coef_ works only for linear kernel
%matplotlib inline
feature_importances = pd.Series(SVM.coef_[0], index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=prediction
TestingDataResults.head()

###### Support Vector Machines(SVM) Regression in Python #######

import pandas as pd

from sklearn import svm

RegModel = svm.SVR(C=2, kernel='linear')

#Printing all the parameters of KNN

print(RegModel)

#Creating the model on Training Data

SVM=RegModel.fit(X_train,y_train)

prediction=SVM.predict(X_test)

#Measuring Goodness of fit in Training data

from sklearn import metrics

print('R2 Value:',metrics.r2_score(y_train, SVM.predict(X_train)))

#Measuring accuracy on Testing Data

print('Accuracy',100- (np.mean(np.abs((y_test - prediction) / y_test)) * 100))

#Plotting the feature importance for Top 10 most important columns

#The built in attribute SVM.coef_ works only for linear kernel

%matplotlib inline

feature_importances = pd.Series(SVM.coef_[0], index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults[TargetVariable]=y_test

TestingDataResults[('Predicted'+TargetVariable)]=prediction

TestingDataResults.head()

Q. How to create an SVM model for Classification in Python?

###### Support Vector Machines(SVM) Classification in Python #######
import pandas as pd
from sklearn import svm
clf = svm.SVC(C=2, kernel='linear')

#Printing all the parameters of KNN
print(clf)

#Creating the model on Training Data
SVM=clf.fit(X_train,y_train)
prediction=SVM.predict(X_test)

#Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns
#The built in attribute SVM.coef_ works only for linear kernel
%matplotlib inline
feature_importances = pd.Series(SVM.coef_[0], index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults['TargetColumn']=y_test
TestingDataResults['Prediction']=prediction
TestingDataResults.head()

###### Support Vector Machines(SVM) Classification in Python #######

import pandas as pd

from sklearn import svm

clf = svm.SVC(C=2, kernel='linear')

#Printing all the parameters of KNN

print(clf)

#Creating the model on Training Data

SVM=clf.fit(X_train,y_train)

prediction=SVM.predict(X_test)

#Measuring accuracy on Testing Data

from sklearn import metrics

print(metrics.classification_report(y_test, prediction))

print(metrics.confusion_matrix(y_test, prediction))

#Plotting the feature importance for Top 10 most important columns

#The built in attribute SVM.coef_ works only for linear kernel

%matplotlib inline

feature_importances = pd.Series(SVM.coef_[0], index=Predictors)

feature_importances.nlargest(10).plot(kind='barh')

#Printing some sample values of prediction

TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)

TestingDataResults['TargetColumn']=y_test

TestingDataResults['Prediction']=prediction

TestingDataResults.head()

Conclusion

This post covers the major supervised machine learning algorithms. With these algorithms under your toolkit, you can easily solve any supervised machine learning algorithm with good accuracy.

Learning never ends though! There are more supervised ML algorithms which you should explore further. I recommend the book Introduction to Statistical Learning(ISLR) to take a deeper dive into the above listed and many more algorithms.

In the next post, I will answer the interview questions from Unsupervised Machine Learning section.

All the best for that Interview!

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

5 thoughts on “Data Science Interview Questions for IT Industry Part-3: Supervised ML”

Dr. Shailendra Badwaik
March 28, 2020 at 5:23 am

Indeed G8! web based tutorial , you made the complex critical knowledge simple. The examples and its illustration are powerful to understand the things in depth.

1. Farukh Hashmi
  April 24, 2020 at 1:26 am
  
  Thank you Dr. Shailendra! Means a lot when it is coming from you.
  
Biswajit Roy
April 19, 2020 at 6:42 am

Great Sir , This has been a good platform for knowledge gathering and building a Strong Foundation on Data Science .Blogs will be like Bible for emerging students.

1. Farukh Hashmi
  April 24, 2020 at 1:25 am
  
  Thank you Biswajit! I am glad you found it useful!
  
BAZ
June 3, 2025 at 10:10 am

I can’t thank you enough

Table of Contents

Overview

Sample Datasets

Create sample datasets in R

Create sample datasets in Python

Q. Explain how the Linear Regression algorithm works?

Q. How to do linear regression in R?

Q. How to do linear regression in Python?

Q. Explain how Logistic Regression algorithm works?

Q. How to do Logistic Regression in R?

Q. How to do Logistic Regression in Python?

Q. Explain how the Decision Tree algorithm works?

Q. How to create a Decision Tree for Regression in R?

Q. How to create a Decision Tree for Classification in R?

Q. How to create a Decision Tree for Regression in Python?

Q. How to create a Decision Tree for Classification in Python?

Q. Explain how the Random Forest algorithm works?

Q. How to create a Random Forest for Regression in R?

Q. How to create a Random Forest for Classification in R?

Q. How to create a Random Forest for Regression in Python?

Q. How to create a Random Forest for Classification in Python?

Q. Explain how the Adaboost algorithm works?

Q. How to create Adaboost for Regression in R?

Q. How to create Adaboost for Classification in R?

Q. How to create Adaboost for Regression in Python?

Q. How to create Adaboost for Classification in Python?

Q. Explain how the XGBoost algorithm works?

Q. How to create XGBoost for Classification in R?

Q. How to create XGBoost for Regression in Python?

Q. How to create XGBoost for Classification in Python?

Q Explain how the KNN algorithm works?

Q. How to create a KNN model for Regression in R?

Q. How to create a KNN model for Classification in R?

Q. How to create a KNN model for Regression in Python?

Q. How to create a KNN model for Classification in Python?

Q Explain how the SVM algorithm works?

Concept of a Separating Hyperplane:

Maximum Margin classifier

Support Vector Classifier

Support Vector Machines

Q. How to create an SVM model for Regression in R?

Q. How to create an SVM model for Classification in R?

Q. How to create an SVM model for Regression in Python?

Q. How to create an SVM model for Classification in Python?

Conclusion

5 thoughts on “Data Science Interview Questions for IT Industry Part-3: Supervised ML”

Leave a Reply! Cancel Reply