Understanding machine learning algorithms and statistics is half of the story.
Choosing the right algorithm for the business problem at hand is the real challenge.
A common question is “I don’t know where to use machine learning in my project?“
My answer is “Understand what is the motive for the project? Help that motive with data science”
Corporations are spending millions in setting up the IT processes and BI in order to gain a certain competitive advantage by trying to understand their customer and their competitors.
If there is anything, they want to forecast better ahead of time, then it is a potential use case for Data Science.
Understand what is the motive for the project? Help that motive with data science
For example, an insurance company may want to forecast “Whether the submitted claim is fraudulent or not?” A telecom company may want to understand “Is this consumer going to stay with us or not?” A consumer product goods company may want to know “How many units of soaps they will be able to sell during next month”
In all these cases, machine learning algorithms can help to predict future numbers by learning the patterns from historical data.
A consumer product goods company may want to know “How many units of shampoo they will be able to sell during next month?”
Having solved business problems in multiple domains using machine learning, I saw few steps being common in every scenario, which I am going to list down and discuss here.
The idea is to use the below steps as a template to identify opportunities for data science in any given business domain and apply the correct machine learning algorithm for it.
In this post, I will be focussing on supervised machine learning only. In a follow-up post, I will discuss the application of the unsupervised techniques.
14-Steps to apply data science for any business problem
- Define the business problem
- Identify the Target Variable in data
- Choosing the appropriate type of machine learning
- Remove useless variables from the data
- Identify ‘potential’ Predictor Variables in data
- Treatment of missing data in each one of the predictor variables
- Treatment of outliers in data
- Splitting the data. 70% for Training 30% for Testing by random sampling
- Creating the model on Training data
- Measuring the accuracy on Testing data
- Repeat steps 7 to 9 at least 5 times (known as Bootstrapping)
- Finding the importance of each predictor statistically
- Train the predictive model on full data
- Deploy the predictive model in production
Let’s get started!
Step-1: Define the business problem
Whenever you solve a data science problem, always begin with the below diagram.
This is arguably the most important step of your data science project. Since it determines the flow of solution and the choice of algorithms.
As yourself this question every time: “I need to create a model which will predict something. What is it?” This will help you to find a reason why the model is required.
What exactly do you want to predict with the help of machine learning model? Do you want to predict sales for next quarter/year? Do you want to predict if the insurance claim submitted is fraud or not? etc.
As yourself this question every time you do machine learning: “I need to create a model which will predict something. What is it?”
Take help of your clients or the Business Analysts to understand the needs.
Understand what kind of predictions will help your clients perform their business better.
Define the business problem statement like below.
“The machine learning model will predict the volume of sales for every upcoming quarter.”
“The machine learning model will predict if the claim submitted is fraud or not?”
“The machine learning model will predict whether the loan application is good or bad?”
“The machine learning model will predict what should be the ideal valuation of a home”
Step-2: Identify the target variable in data
Target Variable: That variable/column/feature in data which you want to predict.
Once you have defined the business problem clearly. Next step is to find out the target variable in the data. Sometimes it is very obvious to spot the target variable by name. For example column names like “sales”, “fraud”, “price”, “turnover” are very easy to spot.
But sometimes the databases may not have a friendly naming convention! so you may see the same names with a more technical naming convention. For example, sales may be present with a name like “SAL_DB_CAPT” so it is not intuitive to spot it on one look. Hence, a business dictionary or a Business Analyst will be helpful to spot the sales column in the data which is the target variable.
Sometimes the databases may not have a friendly naming convention! For example, “sales” can be present with a name like “SAL_DB_CAPT” So it is not intuitive to spot it on one look.
The thumb rule is. Whenever in doubt, quickly take the help of your clients or business analyst. Don’t assume anything unnecessarily!
Step-3: Choosing the appropriate type of machine learning
At this point, you know what is the business problem you are solving and the Target Variable type.
So either you are predicting a continuous number like “Sales”, “Profit”, “Demand”, “Turnover”, “Volumes” “Number of Visitors” etc. This is Regression.
Or you are predicting a categorical/Discrete value like “0/1”, “Yes/No”, “Good/Bad”, “Silver/Gold/Platinum” etc. This is Classification.
Regression and classification are two big types of supervised machine learning which is used in the industry currently. Almost every business domain has one or more applications of these two techniques.
Step-4: Remove Useless Variables from data
There are few types of columns/Variables which are simply useless for the machine learning algorithm because they do not hold any patterns with respect to the
For example, an ID column. It is useless because every row has a unique id and hence it does not has any patterns in it. It is wrong to say if the ID is 1002 then the sales will be 20,000 units. However, if you pass this variable as a predictor to the machine learning algorithm, it will try to use ID values in order to predict sales and come up with some garbage logic!
The rule of Garbage in–> Garbage out applies!
Does that mean every column which has unique values for each row is useless?
Here, business knowledge is very critical. Every continuous numeric column has a unique value for almost every row, like Age, Sales, Demand, etc. But some of these are important from the business perspective. These columns contain patterns in them and that is what the machine learning algorithm needs to find out, for example, when the Demand number is 500 units, then the sales are seen at 450 units.
Business domain knowledge helps in distinguishing useful and useless columns
Few more types of columns are also useless like row ID, phone numbers, complete address, comments, description, dates, etc. All these are not useful because each row contains a unique value and hence there will not be any patterns.
Some of these can be indirectly used by deriving certain parts from it. For example, dates cannot be used directly, but the derived month, quarter or week can be used because these may hold some patterns with respect to the target variable. This process of creating new columns from existing columns is known as Feature Engineering.
Remove those columns which do not exhibit any patterns with respect to the target variable
The thumb rule is: Remove those columns which do not exhibit any patterns with respect to the target variable. Do this when you are absolutely sure, otherwise, give the column a chance.
Step-5: Identify ‘potential’ Predictor Variables in data
This step is the most tricky!
Every business problem is driven by certain factors. For example, sales of a shampoo brand is driven by a lot of factors listed below like
- How good the product is on a scale of 1 to 10?
- How well it is promoted in the market?
- How many retail stores it is available in?
- Is it available in online stores or not?
- Has any movie star endorsed it on TV?
- What were sales last year?
- How many people purchase it out of loyalty?
The list can go on and on.
The important thing is to identify which one of these factors really affect sales (The Target Variable)?
All these factors which influence the target variable are also known as predictors. Because they might help to predict the value of the Target Variable in the
The important thing is to identify which one of these factors really affect the Target Variable?
The next thinkg you may wonder is, where do I begin? How will I know which factors affect the target variable?
This can be solved easily by using a methodical approach listed below:
- Talk to business analyst/clients to understand how do they guess/judge the value of the target variable when it is required. For example, How do they estimate sales? What factors do they consider while planning the supply of goods? These discussions will provide a set of variables which are important from the business perspective.
- Explore each variable/column based on how its values are distributed. In general, a good variable will follow the normal distribution seen below. Basically, the count of low values will be less, count of medium values will be highest and the count of high values will be less again.
- To find the above distribution for every variable, you need to either use a histogram(for continuous numeric variables) or a bar chart (for discrete categorical variables)
- The Idea to see the distribution of values in a column. If it is too far away from the ideal bell curve that column may not be useful.
- Whenever in doubt about a column. Give it a chance and then measure its importance later on using variable importance charts discussed in step-13
Step-6: Treatment of missing data in each one of the predictor variables
Before passing the data to machine learning algorithms, it is mandatory to remove/replace every missing value.
This is required because missing values in data will bias the results produced by the machine learning algorithm.
Missing value treatment is performed separately for each column.
This is also known as Missing data imputation.
How do I replace missing values in data?
There is a neat flow of steps you can follow while treating missing values in any data. These steps are listed below.
- If there are more than 30% missing values in a column, remove that column from data. don’t use it for machine learning. If you try to replace 30% missing values by the median, then it will introduce too much bias in the data and affect the results produced by the predictive model.
- If n(missing) << n(total rows) then delete rows with missing values. For example, if there are 50,000 rows in the data, and only 10 rows have missing values, then we can safely delete those 10 rows from the data to save time.
- If the datatype is Quantitative(Continuous Numeric):
- Impute missing values with the median value of that column
- Another option can be to impute missing values by generating a logical value based on other columns/business rules
- If the datatype is Qualitative (Categorical)
- Impute the missing values with mode value(most frequently occurring value) of that column
- Another option is to Impute missing values by generating a logical value based on other columns/business rules
Step-7: Treatment of outliers in data
What are outliers?
Those values which do not follow the trend in data. A simple example could be seen below
100 is an outlier in the list of values [2, 3, 4, 3, 4, 5, 3, 4, 100]. Since most of the values are single digit 100 is an outlier relative to these values.
In short, a value which is abnormally high or low as compared to the group is an Outlier.
Outlier affects predictive modeling very badly.
if you pass a data to a machine learning algorithm with outliers in it. The algorithm has to include that outlier data in its logic. This will introduce errors in the model.
How to treat outliers?
- Remove outliers from data: This is the easiest way to treat outliers, but, this method causes data loss. If there is no business explanation for a case. Then It must be removed from the data.
- Take log() transform: Computing the log of data will help to remove the effect of outliers. Consider a simple set of values with an outlier in it : [1, 2, 3, 40, 33, 4, 50, 60, 5, 500] if we take the log of this data then it becomes [0.0, 0.7, 1.1, 3.7, 3.5, 1.4, 3.9, 4.1, 1.6, 6.2] and the effect of outliers is removed. It is clearly visible in the below diagram.
- Apart from the log(), you can also use square root, cube root, 1/n, etc and select the transformation which is helping best to remove the effect of outliers.
Step-8: Splitting the data. 70% for Training and 30% for Testing by random sampling
First of all why random sampling?
Please refer to this post to understand the need for sampling in data science in detail.
We need to split data into training and testing sets in order to test the performance of the predictive model by asking it unseen questions i.e. giving input from unseen testing data after it has been trained using the training data.
Why 70:30? Can I select 80:20?
Yes! 70:30 or 80:20 or 75:25 all are acceptable ratios of training and testing data. The idea is to select ENOUGH rows in training data which can cover all the type of patterns in data for the model to learn and then testing data also holds enough rows to test the learnings.
Step-9: Creating the model on Training data
70% of the data is called the Training set because it is used to ‘train‘ the algorithm.
Just like a child understands A is for apple identifies it with the shape-size-color of the fruit. Next time if an apple is shown, the child is able to recognize that this is an apple based on the historical learnings. By remembering the images of fruit and comparing it with the new apple shown.
Similarly, machine learning algorithms learn by example.
“If the input is X then the output should be y”.
We show many such examples of X, y pairs to the machine learning algorithm to help it formulate a trend/logic. So, when a new X is input to the model it predicts the value of y.
The more examples we provide to learn, the better the predictive model becomes. This procedure is also known as supervised machine learning. Since you are supervising the learning of the algorithm by showing it input and expected output examples.
Machine learning algorithms learn by example. If the input is X then the output should be y.
Based on the Target Variable, either regression or classification model is created.
If the Target variable is a continuous number, like sales demand, supply, turnover, etc, then a regression model is created. Common supervised machine learning algorithms used are Linear Regression, Decision Trees, Random Forests,
If the Target Variable is a category, like Yes/No, Class A/B/C, gold/silver/platinum, etc, then a classification model is created. Common supervised machine learning algorithms used are Logistic Regression, Decision Trees, Random Forests,
Try all the algorithms and note down the accuracy for each to decide which one of them is most suitable for the given data.
You should try all the algorithms and note down the accuracy for each to decide which one of them is most suitable for the given data.
Step-10: Measuring the accuracy on Testing data
Once we have created the predictive model, its time to test how efficiently it has learned the examples from Training data.
This can be done by comparing the predictions of the model with original values.
The reason this comparison is done on Testing data because the model has not seen the Testing data set examples. If you do this on training data then the accuracy of the model is bound to be high because all the training examples were already seen by the model. In fact, it was built using these training examples!
Measuring the accuracy on Testing data simulates the future scenario when the model is deployed in production. You get an idea about how the model will perform on live data which is new and unseen to it.
How exactly do you measure the accuracy?
Accuracy is measured by computing a few metrics. For regression and classification, it is done differently.
Regression: Using Mean Absolute Percent Error(MAPE) and Median Absolute Percent Error
Classification: Using F1-Score, Precision, Recall, and AUC.
These values are calculated for all models created by using different algorithms. A comparison between all values is done to see which algorithm is producing the best accuracy for the given data.
Step-11: n-fold Cross-Validation(Bootstrapping)
There is a chance that while randomly selecting records for Training data, the selected records were the lucky ones with neat and clean patterns, hence the accuracy of the model turned out to be high. Or maybe low in another extreme scenario.
In order to be sure that this accuracy is consistent every time we select a random sample multiple times from full data. Then train the model on new Training data and finally test the model on Testing data.
If you repeat this process 5 times then it is called 5-fold Bootstrapping. So on and so forth.
The final accuracy of the predictive model is the average of all accuracies recorded for each cross-validation step.
Hence, you can say the predictive model will perform with 91% accuracy on an average when it is deployed.
Step-12: Finding the importance of each predictor statistically
Which of the used predictors is really affecting the value of Target Variable?
Statistically, it can be measured by using variable importance charts. By looking at it, only the most important features can be selected for final deployment.
Why choose only a few Predictors?
If you keep many predictors in the machine learning model, it will be dependent on those many factors and hence the predictions coming out from the model will be inaccurate many times.
The whole machine learning process is to find out those top factors which affect the target variable most.
A very common method for variable importance chart is present under the Random Forest algorithm. It is neat and easy to use! Works for both regression and classification use cases.
Step-13: Train the predictive model on full data
When you are satisfied with the accuracy of the
This is required because you should expose all available types of patterns in data to the predictive model. So that when any one of the similar pattern is encountered in the live environment, the model will be able to predict the answers accurately.
Step-14: Deploy the predictive model in production
In order to deploy the model in production you should follow below steps
- Save the model as a serialized file which can be stored anywhere
- Create a function which gets integrated with front-end(Tableau/Java Website etc.) to take all the inputs and return the value of Target Variable’s Prediction
The first step is to save the model as a serialized file because right now it is an object in the RAM memory. If you need to share this model with someone it has to be in a file form which can be sent/placed anywhere.
Serialization means converting a memory object in RAM into a physical file
Once you have the serialized model, it can be placed at a server. Then any front end application can access it whenever it is required by using a function.
This function needs to be created as the last part of the deployment. The function will take the input of all the predictor values and pass it to the machine learning algorithm. Then take the prediction and return it to the front end application.
For example, consider below flow. The website is evaluating the price of your bike by asking questions like how many years? How many Kms run and What is the current average? etc
- The click of the Predict button will call the Predict Price function
- The Predict Price function will access the machine learning serialized model placed at the serve and pass the input to the predictive model
- The Predictive Model returns the predicted price
- The Predict Price function will send the predicted price to the front end
The way you apply machine learning matters to the business. It does not matter how many algorithms you have mastered. If you look closely there is always a scope for improvement in any business and it can be a potential use case for machine learning.