# Data Science Interview Questions for IT Industry Part-3: Supervised ML In the previous post, I discussed the major types of machine learning. In this post, I will discuss the popular supervised machine learning algorithms which are asked in the data science interviews.

I will also share the code to implement each one them in R/Python.

### Overview

In most of the IT projects, you will be able to find a use case for Supervised Machine learning. Because there will always be a need to predict a future number like sales, turnover, number of support tickets or the demand for a product, etc. This is when you will perform Regression.

Similarly, there will always be a need to classify records. e.g. If a home loan should be issued or not? What is the value of a customer (Silver, Gold, Platinum), is this insurance claim fraud or not? etc. This is when you will perform Classification

Supervised Machine Learning algorithms can be grouped into two categories (Regression and Classification) listed below. There are many statistical algorithms which are used for Supervised Machine Learning. I am listing some of the popular ones here which can be used to solve any given Regression or Classification problem with good accuracy.

Regression(Predicting a continuous number): Linear Regression, Decision Trees, Random Forests, XGboost, Adaboost.

Classification(Predicting a class): Logistic Regression, Decision Trees, Random Forests, XGboost Adaboost, KNN, SVM.

As you can see some of the algorithms are suitable for regression as well as classification. The final choice is made by measuring the accuracy of the model.

When you are asked about any of the algorithms, try to use below flow while describing it.

• What is the core concept of the algorithm (Overview)?
• The Mathematical logic behind the algorithm.
• In which language(R/Python) you have implemented this algorithm
• What are the hyperparameters you tuned and the effect of these parameters?
• How do you measure its accuracy?
• Which type of use cases fit this algorithm/Why it was selected for your project.

### Sample Datasets

I have used two simple datasets for regression and classification algorithms. I have written the code below to generate these in R and Python.

1. Regression: This small dataset contains the weight and number of hours samples for random people who spend time at the Gym along with how many calories they consume in a day.
Target variable: Weight
Predictors: Hours, Calories

1. Classification: This small dataset contains the loan approval history for various applicants based on their AGE, SALARY and CIBIL score.
Target variable: APPROVE_LOAN
Predictors: CIBIL, AGE, SALARY

#### Create sample datasets in Python

Let’s get started with the questions!

### Q. Explain how the Linear Regression algorithm works?

Linear regression algorithm is based on the equation of a line (y= m * X + C).

The assumption is that the variables are related in a linear way.

In Machine Learning, the target variable is ‘y’ and the predictor is ‘X’.

The linear regression algorithm helps to find the best values of the coefficients ‘m’ and ‘C’ for given data using a cost function. One of the cost functions is known as the Sum of Squared Errors(SSE). There are many others like Root Mean Squared Error(RMSE), Mean of Squared Errors (MAE), log loss or Cross-Entropy loss, etc.

Example: Predict the weight of a person based on the number of hours spent at the gym and the amount of calories intake every day.

Target Variable: Weight

Predictors: Hours, Calories

Simple Linear Regression Equation: Weight = m* (Hours) + C

Multiple Linear Regression Equation: Weight = m1 * (Hours) + m2 * (Calories)+ C

For multiple linear regression, the algorithm finds the best values of the coefficients m1, m2, and C.

If there is one predictor then its called simple linear regression, if there are more than one predictors then it is called multiple linear regression.

Normal linear regression(Ordinary Least Square Regression) does not have any hyperparameters to tune.

If you are using regularization, then either Ridge or Lasso is used.

Regularization means reducing the effect of a variable in the equation by multiplying it by a very small coefficient. Hence, penalize that variable. This penalty can be given in two ways, first is Ridge and second is LASSO.

Ridge regression improves prediction error by shrinking large regression coefficients in order to reduce overfitting, but it does not perform variable selection and therefore does not help to make the model more interpretable. This is where LASSO comes into the picture.

Lasso(Least Absolute Shrinkage and Selection Operator) is able to improve the prediction accuracy and perform the variable selection by forcing certain coefficients(m1, m2, etc) to be set to zero, hence, choosing a simpler model that does not include those predictors.

Accuracy Measurement is done using Mean Absolute Percent Error Or Root Mean Squared Error.

Accuracy=100- MAPE or 100-RMSE.

### Q. Explain how Logistic Regression algorithm works?

Logistic Regression is used for predicting a category, specially the Binary categories(Yes/No , 0/1).

For example, whether to approve a loan or not (Yes/No)? Which group does this customer belong to (Silver/Gold/Platinum)? etc.

When there are only two outcomes in Target Variable it is known as Binomial Logistic Regression.

If there are more than two outcomes in Target Variable it is known as Multinomial Logistic Regression.

If the outcomes in Target Variable are ordinal and there is a natural ordering in the values (eg. Small<Medium<Large) then it is known as Ordinal Logistic Regression.

Logistic regression is based on logit function logit(x) = log(x / (1 – x))

The output is a value between 0 to 1. It is the probability of an event’s occurrence.

E.g. There is an 80% chance that the loan application is good, approve it.

The coefficients β0, β1, β2, β3… are found using Maximum Likelihood Estimation Technique. Basically, if the Target Variable’s value (y) is 1, then the probability of one “P(1)” should be as close to 1 as possible and the probability of zero “P(0)” should be as close to 0 as possible. Find those coefficients which satisfy both the conditions.

The Goodness of Fit (AIC and Deviance ):

These measures help to understand how good the model fits the data. Please note This is NOT the Accuracy of the model.

AIC The Akaike Information Criterion (AIC)
provides a method for assessing the quality
of your model through comparison of
related models. The number itself is not
meaningful. If you have more than one
model then you
should select the model that has the
smallest AIC
.

The null deviance shows how well the
Target variable is predicted by a model that
includes only the intercept (grand mean).

The Residual deviance shows how well the target variable is predicted by a model that includes multiple independent predictors

Accuracy Measurement is done using f1-Score/Precision/Recall/ROC/AUC.

### Q. Explain how the Decision Tree algorithm works?

Decision Trees are suitable for both Regression as well as Classification use cases.

• Decision trees select the best predictor out of all available predictors by measuring its efficiency using either Entropy, Information gain or Gini index
• Basic idea is to choose that predictor which helps to slice the data into two parts in such a way that each part contains similar values of the target variable
• This activity creates a “root node”. Basically the first ‘if-statement’ to check to make decisions
• keep repeating the same activity with both the slices of data until no further splits can be made.

Consider below example where Loan approval historical data is provided.
Target Variable: APPROVE_LOAN
Predictors: CIBIL, AGE, SALARY

Now, we need to find a “rule” or set of rules using this data which can help to make the decision about a new loan application.

The Decision Tree algorithm will try to find the best splitter of data.

Basically, out of AGE, SALARY and CIBIL score of a loan applicant, which predictor “segregates” all the “Yes” cases together and all the “No” cases together?

Since this is a simple data, we can see this with basic data analysis that “All loans are approved if the CIBIL score is more than 550”

Hence, by learning from historical data, we can formulate a rule. IF CIBIL>550 THEN Approve a loan ELSE reject it.

If we pass this data to the Decision Tree algorithm, it finds out this rule by itself and a decision node (IF-ELSE statement) is created as shown below.

If the data is not simple, then the same procedure is repeated until no further splits can be made.

### Q. Explain how the Random Forest algorithm works?

Random Forests are suitable for both Regression as well as Classification use cases.

Random forests are basically multiple decision trees put together. This is also known as bagging

To create a Random Forest predictive model, the below steps are followed.

1. Take some random rows from the data, let’s say 80% of all the rows randomly selected.
Hence every time selecting some different set of rows.
2. Take some random columns from the above data, let’s say 50% of all the columns randomly selected.
Hence every time selecting some different set of columns.
3. Create a decision tree using the above data.
4. Repeat steps 1 to 3 for n number of times (Number of trees). n could be any number like 10 or 50 or 500 etc. (This is known as bagging)
5. Combine the predictions from each of the trees to get a final answer.
In the case of Regression, the final answer is the average of predictions made by all trees.
In the case of Classification, the final answer is the mode(majority vote) of predictions made by all trees.

These steps ensure that every possible predictor gets a fair chance in the process.
Because we limit the columns used for each tree.
Also, there is very less bias in the model because we select some different set of random rows for each tree.

This above procedure helps to get good accuracy out of the box!

Consider below dataset with 20 records. The target variable is APPROVE_LOAN and predictors are CIBIL, AGE, and SALARY.

The first step is to create smaller samples of the above data by choosing a few random rows and predictor columns while always keeping the target variable in the sample. The number of predictor columns being chosen can be controlled using the parameter “max_features” by default it is the square root of the total number of predictor columns. Selecting Random Rows and Predictor Columns from the Training data. APPROVE_LOAN is the target variable, hence it will always be present

The second step is to create independent decision trees on each of the smaller datasets created above and combine the predictions from all of them in the below example, the number of trees (parameter n_estimators) is equal to 3.

The final prediction is the average of all predictions in case of regression and the majority vote in case of classification.

### Q. Explain how the Adaboost algorithm works?

It is an algorithm which ensembles(combines) multiple simple predictive models also known as weak learners to generate a final strong model.

Adaboost algorithm ensembles(combines) multiple simple predictive models also known as weak learners to generate a final strong model.

A decision tree with one level, also known as decision stumps is the most popular weak learner algorithm used in AdaBoost

These are called stumps because these are so simple that when you plot them you just a line (stump!)

Look at the image, here D1 is a decision stump.

Anything which is on the left side of D1 will be classified as positive (+) and anything which is on the right side will be classified as negative (-)

You can observe that it is not very efficient, because there are some positive values on the right side as well.

This is why multiple stumps are combined to create a final model which does a better job of classification

Adaboost is suitable for binary classification problems, however, you can use it for multi-class classification or regression as well.

The basic idea behind Adaboost is Boosting. It works with below listed steps

• Create a first predictive model on original data
• Create a second predictive model which corrects the mistakes of the first model
• Create a third predictive model which corrects the mistakes of the second model
• Keep creating more models until 100% accuracy is achieved OR the max number of iterations(number of trees) is reached

Adaboost take the general concept of boosting and adds a twist to it by assigning weights to data and models.

Adaboost take this general concept of boosting and adds a twist to it by assigning weights to data and models. Below listed steps describe the working of Adaboost.

1. Take the original training data and assign a weight of 1/n to all rows. (n=total number of rows in Training data)
2. Randomly select a subset of rows from the original data and create a predictive model(decision stump)
3. Generate predictions on the above subset using the above predictive model
4. Update the weights in the original data. Give higher weights to those rows where the prediction was incorrect, and give lesser weights to those rows where the prediction was correct. Give the current model higher weight if it is accurate, otherwise, give it less weight
5. Select the rows from original data based on weights. Those with higher weights get selected first.
6. These rows signify the mistakes by the previous predictive model.
7. Create a predictive model on the above data.
8. Repeat steps 3-6  until 100% accuracy is achieved OR the max number of iterations (number of trees) is reached
9. The final prediction is the weighted average of all predictions made by different models created above.

Below flowchart visualizes the above-listed steps for Adaboost.

#### Q. How to create Adaboost for Regression in R?

There is no stable open-source version of Adaboost implementation for Regression in R language as of now.

### Q. Explain how the XGBoost algorithm works?

XGboost is the short form for Xtreme Gradient Boosting.

It is one of the most popular algorithms in machine learning universe. It is used by winners at most of the hackathons!
The reason for its popularity is listed below

• Handles large datasets with ease
• Trains very fast
• High accuracy for most of the datasets

The basic logic behind XGboost is Boosting.

Let us quickly revise how boosting works in general.

• Create a Predictive model M1.
• Create another Predictive model M2 which corrects the mistakes of the previous model M1. Hence M2 is the boosted version of M1.
• Create another Predictive model M3 which corrects the mistakes of the previous model M2.
• Keep repeating the process until the model is 100% accurate OR the maximum number of allowed iteration(number of trees) is reached.

The unique thing about XGBoost is how exactly it corrects the mistakes of the previous model.

The key differentiation about XGB is how exactly it corrects the mistakes of the previous model. In previous algorithm Adaboost, I discussed how it gives higher weights to the incorrectly predicted rows, and the next model focuses to reduce that.

In XGboost, the approach is focussed on reducing the error gradient (Mistakes of the previous model).

XGboost works with below-listed steps.

• Create a predictive model M1 on full Training data with the target variable “y”. (Notice that there is no random sampling here. XGBoost uses full training data.)
• Find the difference between the values of the original target variable “y” and predicted target variable “y1” as “e1 = y – y1”
• Create another predictive model M2 with the target variable as e1 which is “y – y1”. i.e. the gradient of the mistakes by previous model M1.
• Find the difference between the values of the target variable “y-y1” and predicted target variable “y2” as “e2 = y – y1 -y2”
• Create another predictive model M3 with the target variable as e2 which is “y – y1 -y2”. i.e. the gradient of the mistakes by previous model M2.
• Keep repeating these steps until the error becomes zero OR the maximum number of iterations is reached.
• The final model will exhibit the least amount of error and hence it will provide the maximum accuracy by learning from the mistakes of all previous models.
• The final prediction will be the sum of the predictions from all models. y_final = y1 + y2 + y3 …

This seems misleading at once but if you observe closely the Biggest value is y1 and all other values are adjustments to the y1 since these are the predictions for the errors and not the original target variable “y”.

Hence, the sum of all makes it more accurate since every next model tries to minimize the errors made by the previous model.

This is why XGboost is also known as an additive model since you keep adding weak learner models on top of the first one and in the end, you get a strong predictive model.

Xgboost keeps adding weak learner models on top of the first model and in the end, you get a strong predictive model.

### Q Explain how the KNN algorithm works?

KNN stands for K Nearest Neighbours. As the name suggests this algorithm tries to classify a new case based on K nearby points.

A Practical range for K is 2 to 10. If K=3, it means KNN will try to look for 3 nearest points.
It consists of 3 simple steps for any new point

• Find the most K number of similar(closest) points.
• Find the count of each class in those K points.
• Classify the new point as that class which is present the maximum number of times in these K points.
• In the case of regression, take the average of nearest “K” points.

As you can see in the figure. The cross is the new point and we choose K=3.
KNN will look at the nearby 3 points. It can be seen that 2 points are blue out of 3. Hence the new point is assigned to the blue class.

The next question is, how does KNN find the nearest neighbour?
It does so by measuring distances between the points.

Distance between two points can be calculated using any one of the below methods.

• Euclidean Distance: Take the difference between the coordinates of points and add it after squaring.
E.g. Two points A(2,3) and B(6,5) will have a Euclidean distance of (2-6)2 + (3-5)2 = 20

• Manhattan Distance: The sum of absolute differences between the coordinates of points.
E.g. Two points A(2,3) and B(6,5) will have a Manhattan distance of |2-6| + |3-5| = 6

• Minkowski Distance: It is the generalization of both Euclidean and Manhattan Distance. A parameter “q” drives the behaviour of the below formula.
If q = 1, Manhattan Distance
If q = 2, Euclidean Distance

### Q Explain how the SVM algorithm works?

The support vector machine (SVM) is a generalization of a simple classifier called the maximal margin classifier.

SVM is preferred for classification, however, you can use it for regression as well.
Before you jump into SVM, the building blocks of SVM needs to be cleared which are listed below:

• Hyperplane
• Maximum Margin Classifier
• Support Vector Classifier
• Support Vector Machines

Maximum Margin Classifier can be used only when the classes are separated by a linear boundary, for example, look below. Out of the three lines possible to separate the red and blue regions, only one is the best line.

The lines are known as a Hyperplanes. Only one of the hyperplanes is the best one.

#### Concept of a Separating Hyperplane:

If there are “N” number of dimensions then, a hyperplane defined inside it has “N-1” dimensions.

If there are “N” number of dimensions then, a hyperplane defined inside it has “N-1” dimensions.

For example, if we have 2-Dimensions then the hyperplane will be a straight line drawn in it.
Look at the diagram below, The equation of the line drawn is 1 + 2X1 + 3X2 = 0. The two dimensions are X1(X-Axis) and X2(Y-Axis).

Now, if you choose some values of X1 and X2 such that 1+2X1 +3X2 > 0, then you will get the Blue region. These points are “Above” the line.

Again, if you choose some values of X1 and X2 such that 1+2X1 +3X2 < 0, then you will get the Red region. These points are “Below” the line.

You can say this line(1 + 2X1 + 3X2 = 0) is “Separating” the Red Points from the Blue Points.
Hence, this line is also known as a “Separating Hyperplane”

This Separating hyperplane also acts as a classifier. Let’s say a new point (X1n, X2n) needs to be assigned a class (Red or Blue)

If 1+2X1n +3X2n > 0 then it belongs to Blue region/Blue Class, and if If 1+2X1n +3X2n < 0 then it belongs to the Red Region/Red Class

For example, consider the point (X1=1, X2=1.2)
Substituting the values in the equation 1 + 2X1 + 3X2 we get 1+ 2*1 + 3 *1.2 = 6.6.
Now, because 6.6 >0 The point (X1=1, X2=1.2) belongs to Blue class / Blue Region

#### Maximum Margin classifier

Find a Separating Hyperplane in an N-dimensional space. While maximizing the perpendicular distances of the hyperplane from all the points present in training data.

That distance which is the smallest of all the perpendicular distances is known as “margin”.

A Maximum margin classifier is created when which has the largest margin.
Basically, it has the largest distance from the closest points of both all the classes.

Look at the diagram below with 2-dimensional space. With Red and Blue points. The maximum margin classifier is the one
which has maximum margin, i.e. the distance from the closest points of both classes. Maximum Margin Classifier is the thick black line and the points on the dotted line are support vectors

#### Support Vector Classifier

Sometimes there is no “Separating Hyperplane” which clearly separates the data.
This is because the data points are jumbled up and there is no clear boundary between the classes.

Take a look below where the red and blue points are mixed up.
In this case, a generalized version of Maximum margin classifier is used, which is also known as support vector classifier.

The maximum margin classifier tries to classify each and every point perfectly and while doing so it can overfit and may not remain generic.
This is where the concept of “Soft margin” is applied.

“Soft margin” means the classifier is allowing a few wrong classifications in order to make the model generic.

The idea behind support vector classifier is to use a “soft margin”. Support Vector Classifier allows some errors while Maximum Margin Classifier tries to perfectly classify the points

The amount of softness is controlled by a non-negative tuning parameter called “C”. If the value of C is high that means the hyperplane can have more misclassifications. If the value of C is less, then only a few misclassifications are allowed.

If the value of C=0 then it does not allow any misclassifications and hence becomes the maximum margin classifier.

“Soft margin” means the classifier is allowing a few wrong classifications in order to make the model generic.

#### Support Vector Machines

So far the classes were separated by a linear boundary. What if the separating boundary is non-linear?

In such cases, we should try to find an equation which is non-linear, to separate the classes.

This equation is found using “kernels” and the resulting classifier is known as support vector machines. The above diagram shows an SVM with a radial kernel, there could be linear kernels as well.

Hence, you can say that support vector machines are the extension of support vector classifier where the original data is transformed to a higher degree. e.g. linear to quadratic or cubic. so that It becomes linearly separable in that higher dimension.

Support vector machines are the extension of support vector classifier where the original data is transformed to a higher degree.

### Conclusion

This post covers the major supervised machine learning algorithms. With these algorithms under your toolkit, you can easily solve any supervised machine learning algorithm with good accuracy.

Learning never ends though! There are more supervised ML algorithms which you should explore further. I recommend the book Introduction to Statistical Learning(ISLR) to take a deeper dive into the above listed and many more algorithms.

In the next post, I will answer the interview questions from Unsupervised Machine Learning section.

All the best for that Interview! Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach got him to start this blog!

### 4 thoughts on “Data Science Interview Questions for IT Industry Part-3: Supervised ML”

1. Indeed G8! web based tutorial , you made the complex critical knowledge simple. The examples and its illustration are powerful to understand the things in depth.

1. Thank you Dr. Shailendra! Means a lot when it is coming from you.

2. Great Sir , This has been a good platform for knowledge gathering and building a Strong Foundation on Data Science .Blogs will be like Bible for emerging students.

1. Thank you Biswajit! I am glad you found it useful!