Data Science Interview Questions for IT Industry Part-2: Machine Learning

In the previous post, I discussed the interview questions related to self-introduction and statistics. I recommend reading it, to brush up the relevant statistical concepts required to understand machine learning.

In this post, I will list down the important conceptual questions related to Machine Learning frequently asked in the data science interviews.

What is Machine Learning in simple words?
How does machine learning work?
Which programming language is best for machine learning?
What are the different types of machine learning?
What are the different types of supervised machine learning?
What are the different types of unsupervised machine learning?
What is Reinforcement learning?
What are the differences between supervised and unsupervised learning?
Which machine learning algorithm have you used?
How do you know which machine learning algorithm should be used?
Can you explain how does this algorithm works which you have used in your project?

Q. What is Machine Learning in simple words?

When you combine computer programming skills with Statistics, the result is a field of study known as Machine Learning.

Machine Learning is the implementation of statistical algorithms like linear regression, logistic regression, random forests, sampling, hypothesis testing, etc. using a programming language (R/Python/SAS, SPSS, etc.) — What is Machine Learning?

In simple terms. The implementation of statistical algorithms like linear regression, logistic regression, random forests, sampling, hypothesis testing, etc. using a programming language (R/Python/SAS, SPSS, etc.) in order to gain insights from the historical data and sometimes predict what is going to happen in future is known as Machine Learning.

For example, based on the historic trends of sales, predicting what will be sales in the next quarter using a statistical algorithm like Linear Regression.

Another example: Based on historical records of customer payment defaults, either rejecting or approving a new loan application by using a statistical algorithm like Logistic Regression.

Q. How does Machine Learning work?

Why do you know ‘A’ stands for Apple?
Why do you know the square root of 16 is 4?

You know these because someone ‘taught’ you these as concepts/facts by giving examples.

Machine Learning works in the same way. Statistical algorithms learn by looking at a lot of examples, like, if ‘this’ is the input then ‘that’ should be the output. The more examples provided to the algorithm, the more accurate the machine learning model becomes.

Machine Learning algorithms learn by looking at a lot of examples, like, if ‘this’ is the input then ‘that’ should be the output — How Machine Learning Works

In the above picture, the Machine Learning algorithm is observing the given examples about the loan approval cases history. It is learning when to say yes and when to say no based on the values of CIBIL SCORE and SALARY.

Prediction for a new case by the ML model

Once the algorithm has seen many examples, it will become intelligent! Now If you pass a new case to analyze. The Machine Learning model will answer either a Yes or a No Based on the observed cases.

Q. Which programming language is best for machine learning?

There is no correct answer to this question!

There are many programming languages in which you can perform machine learning. Popular ones are R, Python, SAS. There are some drag and drop tools also for Machine learning like SPSS, Rapid Miner, AzureML, Google’s AutoML, etc.

Right now the most popular choices for machine learning could be R/Python. Since both languages have implementations of almost every statistical technique present in the books. Also, both of these are open source.

Python is better than R when you need to perform Deep Learning(Image detection, face recognition, chatbots, etc.)

Q. What are the different types of machine learning?

There are three major types of machine learning:

Supervised ML: Teach the algorithm by examples. If the input is X then the output should be y (Target Variable). Some popular algorithms for this are linear regression, logistic regression, decision trees, random forests, SVM, naive Bayes, XGboost, AdaBoost, etc.
Unsupervised ML: No Target variable is present. Basically you DON’T have data which says the input is X and output is y. In this scenario, important patterns can be derived directly from data like grouping the similar type of rows (Clustering) without any prior knowledge of the given data. Some popular algorithms are K-Means, DBSCAN, PCA, ICA, Apriori.
Reinforcement ML: When the Machine Learning algorithm learns by its mistakes and improvises in the next iteration in order to achieve an objective. Some popular algorithms are Monte Carlo, Q-Learning, SARSA.

Supervised, Unsupervised and Reinforcement are the three different types of machine learning — What are the different types of machine learning?

Q. What are different types of Supervised machine learning?

Regression and Classification are two types of supervised machine learning.

Regression: Under supervised machine learning, when you are predicting a continuous number like “Sales”, “Profit”, “Demand”, “Turnover”, “Volumes” “Number of Visitors” etc. This is Regression.
Classification: When you are predicting a categorical/Discrete value like “0/1”, “Yes/No”, “Good/Bad”, “Silver/Gold/Platinum” etc. This is Classification.

Choosing the appropriate type of machine learning is done by looking at the target variable. If target variable is continuous, then its a regression, if the target variable is categorical then its classification. — Regression or Classification?

Regression and classification are two big types of supervised machine learning which is used in the industry currently. Almost every business domain has one or more applications of these two techniques.

There are many algorithms which are used for supervised machine learning, I am listing some of the popular ones.

Regression and Classification are two types of Supervised Machine Learning algorithms. There are various algorithms for each of them like Linear Regression, Logistic Regression, Decision Trees, Random Forests, XGboost, Adaboost. — Algorithms used for Supervised Machine Learning

Regression: Linear Regression, Decision Trees, Random Forests, XGboost, Adaboost.

Classification: Logistic Regression, Decision Trees, Random Forests, XGboost, KNN, SVM.

Q. What are the different types of Unsupervised machine learning?

There are three major types of Unsupervised Learning.

Clustering
Dimension Reduction
Association.

As the name suggests, unsupervised means letting the algorithm find out patterns in the data by its own. You are NOT supervising the learning process. You are not showing examples to the algorithm like if this is the input X then that is the output y.

Clustering, Dimension Reduction and Association are the three types of Unsupervised Machine Learning. — Unsupervised Machine Learning Algorithms

Clustering: Creating groups of similar rows together. The idea is to club all the similar type of data rows as one cluster/Group. Important clustering algorithms are K-Means, Hierarchical clustering, DBSCAN, OPTICS.
Dimension Reduction: Reduce the number of predictor columns in data by combining similar ones as a single column. This process is known as dimension reduction. Because each column in your data represents one dimension If the data is high dimensional (for e.g. 500 columns), it decreases the efficiency of predictive models in terms of speed and accuracy. Hence it would be great if we can shrink those 500 columns to some 5-10 columns representing the same patterns of all 500 columns. Some of the Algorithms used for Dimension Reduction are PCA, ICA, T-SNA, UMAP.
Association: Finding out which products sell together is the top application of Association rule mining. By generating the support, confidence and lift scores by counting the transaction items, rules are generated to understand associations. In simple terms, if the user purchases item A, how likely they are to purchase item B? The algorithms used to find associations are
Apriori, Eclat, FP-Growth.

Q. What is Reinforcement learning?

Reinforcement learning is the type of machine learning where the algorithms learn by making mistakes and improve in the next iteration.

The basic idea is to give a reward(add one point) if the algorithm takes a correct step, and similarly give a punishment(subtract one point) if the algorithm takes an incorrect step. Very much like teaching a child! However here you don’t supervise the learning, you simply define the rewards and punishments and leave the algorithm to perform by getting feedback on its own.

Some of the algorithms used for reinforcement learning are:

Monte Carlo
Q-Learning
SARSA (State-Action-Reward-State-Action)

Q. What are the differences between supervised and unsupervised learning?

In Supervised machine learning, you teach the algorithm by showing examples like if the input is X then the output is y.

In Unsupervised machine learning the algorithm finds the patterns in data by itself. You DON’T have to supervise the learning process.

There is one target variable (Dependent Variable) in supervised machine learning which you predict using a set of predictors (Independent Variables).

There is no target variable in unsupervised machine learning.

Q. Which machine learning algorithm have you used?

There is no fixed answer to this question.

My recommendation is to tell those algorithms which you have used in your projects AND you know how they work (the math behind). For example, you can say you have used, Linear Regression, Random Forests, and XGboost to predict the sales for the next quarter and XGboost algorithm gave the best accuracy so it was deployed in production.

After this, be ready to answer any questions related to the working of these algorithms. 🙂

Q. How do you know which machine learning algorithm should be used?

The key is to understand first what type of business problem are you solving?

Based on the type of problem, the algorithms are selected. I am listing some of the important algorithms and business problem types below.

Regression: Predicting a Number (Linear Regression, Decision Trees, Random Forests, XGboost, Adaboost)
Classification: Predicting a Category (Logistic Regression, Decision Trees, Random Forests, XGboost, KNN, SVM)
Clustering: Grouping similar rows (K-Means, Hierarchical clustering, DBSCAN, OPTICS)
Dimension reduction: Reducing the number of variables in data (PCA, ICA, T-SNA, UMAP)
Association: Finding out which products sell together (Apriori, Eclat, FP-Growth)

Q. Can you explain how does this algorithm works which you have used in your project?

This is a sure-shot question. Once you talk about your project and explain how you have used linear regression or decision trees or KNN, the next question is “How it works?“

I have seen people rushing into details when asked about intuition and vice versa. Try to listen carefully and then answer. Understand, if the interviewer is interested in the mathematical details of the algorithm or the intuition? If in doubt, you can ask if the expectation is for the details or overview then answer accordingly.

I recommend below flow while answering this question:

Overall intuition of algorithm. Key highlights. Why did you select the Algorithm?
The goodness of fit measurement. Basically, how did you knew the algorithm is able to learn the data correctly.
Hyperparameters. Which are the parameters you used and tuned to get the best results from the algorithm?

In the next series of posts, I will explain the algorithms listed under supervised, unsupervised and reinforcement learning along with R/Python code snippets.

All the best for that Interview! 🙂

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com