Introduction and Most Important Statistical concepts
Facing an interview for your next data science job can be easy!
Just make sure that the basics are in place and your passion for solving problems is highlighted during the interaction with the interview panel.
6 out of 10 questions in a data science interview will be pure concept based or theoretical. Hence it is easy to score well in this area.
Statistics is a vast subject and no one knows all of it! But there are some bare minimum concepts which you must know as a Data Scientist.
I have listed here some of the important questions asked very frequently in data science interviews related to statistics. I will write about Machine Learning and Data Science in the next series of blogs.
If you need answers for any specific question let me know in the comments section! I will be happy to help.
Table of contents
- Tell me about yourself
- Tell me about your recent data science project
- How did you measure your predictive model accuracy?
- What is MAPE?
- What is the Median APE?
- What is RMSE?
- Which one is better RMSE or MAPE?
- How to measure accuracy for Regression Models?
- What is R Squared value in Regression?
- How R Squared value is calculated?
- What is Adjusted R Squared Value?
- How to create a Confusion Matrix?
- What is Precision?
- What is Recall?
- What is F1-Score?
- What is Sensitivity?
- What is Specificity?
- What is ROC?
- What is AUC?
- What is Hypothesis testing?
- What is P-Value?
- What is Alpha Value?
- What is Confidence Interval?
- What is Type-1 error?
- What is Type-2 error?
- What is Standard Deviation?
- What is an Outlier?
- What is Correlation?
- What is Sampling?
- What are the types of sampling?
- Why do we use sampling in machine learning?
- What is central limit theorem?
- What is T-Test?
- What is Z-Test?
- What is F-Test?
- What is ANOVA test?
- What is Chi-Square test?
- What is AIC?
- What is BIC?
- What is Entropy?
- What is Information Gain?
- What is Gini Index?
- What is Multicollinearity?
Let’s get started!
Q: Tell me about yourself
The most cliche question ever! But still, it is asked in every interview 😉
This is the opportunity to showcase how you have redefined yourself as a data scientist by learning the required concepts and started working on it by generating a use case in your current project.
In the IT industry, you can’t control which project you will get assigned to. Hence, many of us keep doing what we were given as the first project. If you can showcase that you have taken control of your career path by finding out what you like to do and learned it. It shows that you are serious about your career and goals. This is a great booster as it creates the image of self-learning and self-motivated individual.
Write, rehearse, memorize this introduction and give a hint of your data science transition, with a line like “I learned data science from XYZ classes and then got an opportunity to work on predictive analytics. The first implementation I did was supervised machine learning regression problem”
Q: Tell me about your recent data science project
This is another fixed question which you must prepare well. Since you know it will be asked! Almost every interviewer will ask this question because this is where the next set of questions will emerge. So it is very important for you to have this ready in detail. This will control the flow of your interview.
A good way is to go with below flow
- State the business problem
- Explain what was your approach to solve it
- Narrate which machine learning algorithm you used
- How was the deployment done
- How it helped to improve the business
This flow approach will show your maturity as an experienced professional. Never state the technical details first. Focus on the bigger picture and then if asked about specifics of the algorithms or deployment, then only go into details.
If you are asked to provide more detailed flow then follow this blog post to get a step by step approach in order to narrate the story.
Q: How did you measure your predictive model accuracy?
Accuracy is measured differently for Regression and Classification predictive models. Depending on what type of project you did, mention how accuracy was measured and how the model was deployed in production.
Regression: 100-MAPE or 100- (Median APE)
Classification: F1-Score, Precision, Recall, AUC, ROC
All these terms are explained one by one in the following section.
Q: What is MAPE?
Mean Absolute Percent Error.
This value gives an idea that, on an average, how much error the predictive model is doing for each of the predictions.
APE is calculated by finding the absolute percentage difference between the predicted and original values. Take a look at the above example, in the first row the APE is 25% because the absolute difference between Original and Prediction is 5 and when we divide it by Original value 20 it gives 25%.
In a way you can say the prediction was 25% away from the original, hence the accuracy for this prediction was 100-25 = 75%. Now if we need to understand the overall accuracy, then we take the average of error for each of the predictions. This is known as Mean Absolute Percent Error.
Accuracy of the model is 100- MAPE
Q. What is the Median APE?
Median Absolute Percent Error.
You can see MAPE is higher because of the outlier ‘25.0’, which means there is one prediction which has 25% error.
The Median APE is used because Mean APE is affected by outliers and can go above 100% also, which will make the accuracy value(100-MAPE) negative.
This is helpful to analyze the central tendency of the error committed by the predictive model. For example, if the Median APE is 5% then it tells that if there are 60 total predictions done, 30 of those will have an error value of less than 5%.
Q. What is RMSE?
Root Mean Squared Error
- Find out the difference between original and predicted values for each row.
- Square the differences
- Sum all squared differences
- Take the average of the above sum
- Take the square root of above average
Q. Which one is better RMSE or MAPE?
In terms of interpretation, MAPE is better because it is easy to visualize it. It represents the on an average error committed by the predictive model. RMSE does not have such a clear visualization.
In terms of penalizing the large error (Outliers), RMSE is better. MAPE gets affected by the outliers.
Q. How to measure accuracy for Regression Models?
Subtract the Mean Absolute Percent Error from 100 and the value is accuracy. All the below calculations will fetch accuracy values for the predictive model.
- 100-(Median APE)
Q. What is R-Squared value in Regression?
R2 value measures the goodness of fit. It is NOT the Accuracy of the model. Accuracy is measured using 100-MAPE value.
It tells how many data points are being explained by the model out of all the data points. That means variance explained by the model Vs Total Variance.
- Max value of R2 is 1
- Min Value of R2 is 0
An Ideal range of R2 value is between 0.6 to 0.9. This means the predictive model is able to explain a good amount of variance in the data and can be taken into consideration for testing and accuracy calculation on Test Data.
- R2 < 0.5 means tending towards Underfitting of the model
- R2 > 0.9 means tending towards Overfitting of the model
From a visual perspective How many points are closer to the line of best fit Vs how many points which are far away from the line.
Q. How R-Squared value is calculated?
Consider below an example of five predictions and original values
SSres means the Sum of Squared Residuals.
In the above example, SSres is 39.
SStotal means the Sum of Squared distance of each point from the mean value.
In the above example, SStotal is 75.2
The calculation for SSres and SStot can be seen below.
- SSres = (12-10)² + (14-13)² + (18-15)² + (20-33)² + (11-15)²
- SStot = (10-14.6)² + (14.6-14)² + (18-14.6)² + (20-14.6)² + (11-14.6)²
This equates to R2 = 1- (39/75.2) = 0.48
This means, for the given data, model was able to explain 48% of variance out of total variance.
Q. What is Adjusted R-Squared Value?
- p = Total number of explanatory variables in the model
- n = number of rows in training data.
The adjusted R2 value is always less than R2 and It can be negative also.
Adjusted R2 takes into account the addition of new predictors to the model. It adjusts the value and does not allow the variance explained to increase just for adding new predictor.
The Adjusted R2 value increases only if the new predictor is significant and helpful to predict the target variable. Whereas, R2 increases with every new predictor’s addition to the model.
Hence Adjusted R2 value is more accurate while judging the goodness of fit for regression models.
Q. How to create a Confusion Matrix?
A Confusion matrix is created by comparing original values with predicted values in a classification model.
- True Positive(TP): How many times Yes was predicted as Yes
- True Negative(TN): How many times No was predicted as No
- False Positive(FP): How many times No was predicted as Yes
- False Negative(FN): How many times Yes was predicted as No
In below example, all of the above have been counted and the resultant matrix is known as a confusion matrix.
Q. What is Precision?
How many correct predictions were done for a class out of all predictions for that class?
Precision for ‘Yes’ class will tell out of all the ‘Yes’ predicted by the algorithm, how many were correct? Similarly, Precision for ‘No’ class will tell out of all the ‘No’ predicted by the algorithm, how many were correct?
i.e how precise the prediction is for that class.
A Good range for precision is 0.7-0.9
Q. What is Recall?
How many actual values were correctly recalled by the model? In other terms, how many predictions were correct out of all the original values for that class.
Recall for ‘Yes’ will tell out of all the Actual ‘Yes’ values how many were correctly predicted by the model.
Recall for ‘No’ will tell out of all the Actual ‘No’ values how many were correctly predicted by the model.
A good range for the recall is 0.7-0.9.
Q. What is F1-Score?
F1-Score is the harmonic mean of Precision and recall.
It is the accuracy of classification predictive model. It tells how efficient the model is while predicting Yes as Yes and No as No.
A good range for F1-Score is 0.7-0.9
Q. What is Sensitivity?
Recall(Yes) is also known as Sensitivity. The True Positive Rate (TPR)
Q. What is Specificity?
Recall(No) class is also known as Specificity. The True Negative Rate (TNR)
Q. What is ROC?
The curve between True Positive Rates(TPR) in Y-Axis and False Positive Rates(FPR) in X-Axis is known as the ROC curve. ROC stands for Receiver Operating Characteristic.
The plot is generated by capturing (TPR, FPR) values for multiple iterations of sampling and predictions.
Q. What is AUC?
Area Under the Curve (AUC)
The amount of area covered under the ROC curve. Perfect classification will have its value as 1. A good range for AUC is 0.6-0.9. Which helps to understand the performance of the model. Higher the AUC the better it is.
If the value of AUC is less than 0.5 then it means the predictive model is not able to discriminate between the classes.
Q. What is Hypothesis testing?
Hypothesis means assumption.
To test whether our assumption is correct based on given data is Hypothesis testing.
Consider a scenario from a tire factory. The radius of the ideal tire must be 16 inches. However, even if there is a deviation of 8% then it is accepted. Hence in this scenario, we can apply hypothesis testing like below using some dummy values for the explanation.
- Define the Null Hypothesis (H0): The radius of the tire= 16 Inch
- Define the alternate Hypothesis(Ha): The radius of the tire != 16 Inch
- Define the error tolerance limit: 8%
- Conduct the test: Chosen T-Test
- Look at the P-value generated by the test: P-value= 0.79
- If P-Value > 0.05 then accept the Null Hypothesis otherwise reject it. : Accept the Hypothesis, Hence, The tire produced is of good quality
Q. What is P-Value?
P-Value is the probability of H0 being True.
The higher the P-value, the better the chances of our assumption(H0) to be true. The Textbook threshold to reject a Null Hypothesis is 5%. So, if P-Value is less than 0.05, this means there is less than 5% chance of Null Hypothesis being true, hence it is rejected. Otherwise, if P-Value is more than 0.05, then the Null Hypothesis is accepted.
Q. What is Alpha-Value?
The acceptable error threshold. Also known as Level of significance.
In the above tire example, the acceptable error amount was 8%
Q. What is Confidence Interval?
The range of values that can contain the population mean is based on the error threshold (Alpha Value).
In the above example the population mean is 16. As we have assumed that all good tyres are produced with 16 inches radius.
If we take a sample of 50 tyres, then we will have values like 16.2, 16.3, 15.98, 15.96, 15.99, 16.23…. so on an so forth.
For the sake of understanding let’s say the mean radius of those 50 tires came out to be 16.15. This is called Sample mean.
Now, based on this sample, we can calculate a range. The min and max values between which the mean of the population can be seen. The mean of the population is the mean of the radius of all the tires.
So basically, we are trying to estimate, how the mean of all the tyres look based on the given sample. And instead of giving a single value answer, we are providing a range of values. This range is known as Confidence Interval.
The confidence interval is affected by the alpha value. For every alpha value, we find the value of the statistic which gets multiplied with the standard error.
Confidence Interval = [ Mean(Sample) + N*(SE), Mean(Sample) + N*(SE)]
- SE=Standard Error=Standard Deviation of sample/sqrt(number of samples)
- N= Value of the statistic. If the population follows Normal Distribution then Z-statistic, if the population follow t-distribution then the t-statistic value for the given alpha value(probability of error margin)
For example, let us choose the alpha value of 5%. Hence, we are 95% confident that the mean value of the population will fall in between the confidence interval we find. Assuming normal distribution the value of N is 1.96 for alpha=5%. Similarly, the value of N is 2.68 for alpha=1%. So on and so forth. These “N” values are generated out of the probability distribution Z-values or the ideal bell curve distribution.
Hence, to calculate a confidence interval of the population mean. We need a sample of values, we calculate its mean, we calculate its standard deviation, we find the N-value based on the alpha level.
For the sake of explanation, assume below values were found for a sample of 50 tyres.
- Sample Mean of radius=16.15
- Standard deviation of 50 radius values=0.64
- N=1.96 for alpha=5%
For the above values, the confidence interval will be calculated as [ 16.15 – 1.96*(0.64/sqrt(50)) , 16.15 + 1.96*(0.64/sqrt(50)) ].
Which comes out as [15.97 , 16.32].
Hence, based on the given sample of 50 Tyres we are 95% confident that the mean value of the radius of all the tires (population) will be somewhere between 15.97 and 16.32.
Q. What is Type-1 error?
A type-1 error, also known as an error of the first kind, occurs when the null hypothesis (H0) is really true but is rejected.
In terms of the confusion matrix, the False Positives(FP) are Type-1 errors.
Q. What is Type-2 error?
A type II error, also known as an error of the second kind, occurs when the null hypothesis is false, but it is erroneously accepted as true.
In terms of the confusion matrix, the False Negatives(FN) are Type-II errors.
Q. What is Standard Deviation?
Standard Deviation tells us the overall spread of the values by giving us the average distance of each point from the mean value. In other terms, on an average how far each point is from the mean.
All we need to do is take the square root of the variance. We call this the standard deviation. If this value is large, it means the data is very scattered, if this is small then the data is consolidated and close to each other in value. More details about standard deviation can be found in this blog post.
Q. What is an Outlier?
Certain values which are extremely low or extremely high compared
to all other values in a dataset are called outliers.
Eg (1, 2, 3, 4, 5, 6, 50), here 50 is an outlier because it is
abnormally large than most of the values in the dataset.
Q. What is Correlation?
Correlations are mathematical relationships between variables. You
can identify correlations on a scatter diagram by the distinct
patterns they form. The correlation is said to be linear if the scatter
diagram shows the points lying in an approximately straight line.
Let’s take a look at a few common types of correlation between
Positive linear correlation (r=0 to 1)
Positive linear correlation is when low values on the x-axis
correspond to low values on the y-axis, and higher values of x
correspond to higher values of y. In other words, y tends to
increase as x increases.
Negative linear correlation(r= -1 to 0)
Negative linear correlation is when low values on the x-axis
correspond to high values on the y-axis, and higher values of x
correspond to lower values of y. In other words, y tends to
decrease as x increases.
If the values of x and y form a random pattern, then we say there’s
Q. What is Sampling?
Sampling means choosing random values.
Consider a bubble gum jar below with various colors of bubble gums.
If you ‘randomly’ select a few gums from the jar, it is very likely that the selected ones will have gums of all colors.
Hence, you can say that the randomly selected sample is a representative of all the gums present in the jar.
In statistical language, these randomly selected gums are known as the sample, and the jar is known as the population.
More details about Sampling Theory can be found in this blog post.
Q. What are the types of sampling?
There are four major types of sampling techniques listed below. More details can be found on this blog post
- Simple Random Sampling
- Stratified Sampling
- Systematic Sampling
- Biased sampling
Q. Why do we use sampling in machine learning?
Sampling Theory helps you to examine how good the predictive model will perform BEFORE it is deployed in production. Typically we keep 70% to train the model and 30% to test the model. However, this ratio can be changed to 80:20 or 75:25 and the results are observed. More details about it can be found in this blog post.
Q. What is central limit theorem?
If you repeatedly take large(more than 30 values) samples of size n from a population, then the mean values of all those samples will follow a normal distribution. i.e if you plot its histogram then it will form a bell curve.
Q. What is T-Test?
The T-Test is one of the many Tests employed in Hypothesis testing.
It is Used to see if the mean of the population is statistically different from an assumed value(Null Hypothesis).
Consider below example where we are selecting some random number of gumballs from a jar.
Assumption: The average size of all gumballs inside the jar is 25mm (µ0)
If you randomly select some 20 gumballs from the jar then the average size of those gumballs should be 25mm. However, it can vary a little bit due to manufacturing defects so let us say the average came out to be 24.3mm(X̄)
Assuming Standard Deviation of sizes of all gumballs: 0.1mm(sd)d
T-Value= (X̄ – µ0 ) / (s / √(n)
Here in our case : T-Value= (24.3 – 25 )/ (0.1 /√20) = -31.30
Higher the absolute T-Value, the difference between the mean of population and sample will be statistically significant.
Lower the absolute T-Value, the difference between the mean of population and sample will NOT be statistically significant. I.e the means are equal to each other from both sources.
The t-test is also used in Linear Regression to test which variable is helping to predict the target variable and which is not.
H0: The variable is not helping
The t-test is conducted for each of the variables and it produces a T-Value and a probability. If the probability (p-value) is less than 0.05 then we reject the hypothesis(H0). That means the variable is helping and our assumption was wrong. So we select the variable in the model.
Q. What is Z-Test?
Z-Test is same as T-Test. We use Z-Test when the sample size is MORE than 30 and otherwise T-test is used.
The t-test is used when the sample size is LESS than 30.
In Z-test, if the variance σ is not known, then it is approximated using the sample values as sd/√(n).
Z-score can be calculated from the following formula.
z = (X – μ) / σ
Where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.
Q. What is F-Test?
F-Test is used to check if the VARIANCES are equal for two populations.
It is also used in Linear Regression, where the Null Hypothesis(H0) is: Model cannot be created. if the P-Value < 0.05 that means the H0 was incorrect and hence rejected and the model is accepted.
Q. What is ANOVA test?
Analysis of Variance is used when sample means from more than 3 populations are to be compared. The F-Test is employed to do the comparison.
ANOVA stands for Analysis of Variances. It is used when means of more than 3 groups are to be compared. The t-test cannot be used here since T-Test can compare means from a maximum of 2 Groups. Hence When we have more than 2 groups we use ANOVA is performed using the F-Test.
Q. What is Chi-Square test?
Chi-Square test is used to check if there is any relationship between two categorical variables. We cannot compute the correlation value between two categorical variables hence Chi-square test is used.
χ2 = (Observed – Expected)² /Expected
Q. What is AIC?
AIC The Akaike Information Criterion (AIC) provides a method for assessing the quality of your predictive model through comparison of related models. The number itself is not meaningful. If you have more than one model then you should select the model that has the smallest AIC.
AIC is used in Logistic Regression to perform goodness of fit test since there is no R2 for Logistic Regression.
AIC is computed using below equation
AIC = -2 * log-likelihood + K * nPar
- log-likelihood: The log-likelihood of logistic regression
- K : 2
- nPar: Number of columns
Q. What is BIC?
The formula for the Bayesian information criterion (BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters. Unlike the AIC, the BIC penalizes free parameters more strongly.
BIC is computed using the below equation which is the same as AIC. The only difference is the value of K which is used in BIC
AIC = -2 * log-likelihood + K * nPar
- log-likelihood: The log-likelihood of logistic regression
- K : log(number of Rows)
- nPar: Number of columns
Q. What is Entropy?
Entropy means randomness. It is used to measure the randomness/impurity in a group.
In simple terms, if all the entities in a group are of the same type then its pure and its randomness is also less, hence the entropy is Zero.
pi is the probability of class i.
The entropy of a group in which all examples belong to the same class is Zero. It means minimum randomness.
entropy= -1 log2 (1)=0
The entropy of a group in which 50% of examples belong to the same class is 1. It means maximum randomness.
entropy = -0.5 log20.5 – 0.5 log20.5 =1
Machine Learning Algorithms ID3 (Iterative Dichotomiser 3),C4.5, C5.0 all of these uses Entropy in order to find the best root node and split further nodes.
Q. What is Information Gain?
How much information is gained if a node is split in a decision tree? In other terms how much discriminative power is gained if the node is split.
Formally it is defined as below. The total entropy of Parent node minus the weighted average of Entropy of all child nodes.
Information Gain= Entropy(Parent Node) – Average(Entropy(All Child Nodes))
In the best case scenario, the parent node will have highest Entropy=1 and all the child nodes will have Entropy=0. The information gain in this scenario will be the highest. The value will be equal to 1.
Machine Learning Algorithm CART (Classification and Regression Trees) uses Information Gain or Gini Index to find the best root node and split further nodes.
Q. What is Gini-Index?
Gini Index is similar to Entropy. If all values are same, then the value of Gini Index will be zero. Otherwise, it will be some positive value calculated using the below formula.
For example, if there are two classes (Binary Classification) and both of them are present 50/50 then the value calculated will be as below:
Gini= 1 – (1/2)^2 — (1/2)^2 = 0.5
Q. What is Multicollinearity?
Collinearity means a linear relationship between two variables.
Two variables are perfectly collinear if there is an exact linear relationship between them. For example, V2 = a* V1 +b. If there is such a relation, then V1 and V2 are collinear.
Multicollinearity refers to a situation in which two or more explanatory variables in a Multiple Regression model are highly linearly related.
More commonly, the issue of multicollinearity arises when there is an approximately linear relationship between two or more independent variables(Predictors).
In simple terms Two Predictor variables have a high correlation value will generate Multicollinearity.
It is bad for R2 value since it inflates it. This happens because the model thinks it is explaining a lot of variances, but it is actually explaining the same variance twice (High Correlation between Predictors).
Q. How to remove Multicollinearity in Data?
- Check the VIF of all the Predictor variables using vif() function from the library(car) in R. OR the variance_inflation_factor() function present in statsmodel lib in python
- If any variable Has VIF>5 then remove it from the regression equation.
- Re-check the VIF
- Repeat Steps 1-3 till all variables have VIF<5
Q. What is VIF
Variance Inflation Factor. It is used to detect multicollinearity in data.
R²j is the R-Squared of regression of Predictor j on all the other Predictors.
- A tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem.
- If VIF is found with multiple Predictors, The Predictor with Highest VIF is removed and the test is conducted again.
- library(car) in R has a function called vif() to calculate VIF for each of the predictors.
- variance_inflation_factor() function present in statsmodel lib in python.
Conclusion and Further Reading:
Statistics is the backbone of Data Science. All the machine learning is possible because of statistics is combined with programming.
In order to apply machine learning as data science, you must understand the statistics behind it because it helps to choose the right thing at the right place.
I will highly recommend you to go through below resources to deep dive into statistics.
- Head First Statistics: This book makes statistics fun! Written in an easy to understand way for anyone who hates mathematics.
- Khan Academy: This website takes you from very basic to advanced level concepts in a step by step way
If you feel there is any concept for which you need explanation. Please submit your question in the comments. I will add that in this list.