How to apply Data Science for any business problem

Understanding machine learning algorithms and statistics is half of the story.

Choosing the right algorithm for the business problem at hand is the real challenge.

A common question is “I don’t know where to use machine learning in my project?“

My answer is “Understand what is the motive for the project? Help that motive with data science”

Corporations are spending millions in setting up the IT processes and BI in order to gain a certain competitive advantage by trying to understand their customer and their competitors.

If there is anything, they want to forecast better ahead of time, then it is a potential use case for Data Science.

Understand what is the motive for the project? Help that motive with data science

For example, an insurance company may want to forecast “Whether the submitted claim is fraudulent or not?” A telecom company may want to understand “Is this consumer going to stay with us or not?” A consumer product goods company may want to know “How many units of soaps they will be able to sell during next month”

In all these cases, machine learning algorithms can help to predict future numbers by learning the patterns from historical data.

A consumer product goods company may want to know “How many units of shampoo they will be able to sell during next month?”

Having solved business problems in multiple domains using machine learning, I saw few steps being common in every scenario, which I am going to list down and discuss here.

The idea is to use the below steps as a template to identify opportunities for data science in any given business domain and apply the correct machine learning algorithm for it.

In this post, I will be focussing on supervised machine learning only. In a follow-up post, I will discuss the application of the unsupervised techniques.

14-Steps to apply data science for any business problem

Define the business problem
Identify the Target Variable in data
Choosing the appropriate type of machine learning
Remove useless variables from the data
Identify ‘potential’ Predictor Variables in data
Treatment of missing data in each one of the predictor variables
Treatment of outliers in data
Splitting the data. 70% for Training 30% for Testing by random sampling
Creating the model on Training data
Measuring the accuracy on Testing data
Repeat steps 7 to 9 at least 5 times (known as Bootstrapping)
Finding the importance of each predictor statistically
Train the predictive model on full data
Deploy the predictive model in production

Let’s get started!

Step-1: Define the business problem

Whenever you solve a data science problem, always begin with the below diagram.

How to apply Data Science for any business problem:Define the business problem — What will the model predict?

This is arguably the most important step of your data science project. Since it determines the flow of solution and the choice of algorithms.

As yourself this question every time: “I need to create a model which will predict something. What is it?” This will help you to find a reason why the model is required.

What exactly do you want to predict with the help of machine learning model? Do you want to predict sales for next quarter/year? Do you want to predict if the insurance claim submitted is fraud or not? etc.

As yourself this question every time you do machine learning: “I need to create a model which will predict something. What is it?”

Take help of your clients or the Business Analysts to understand the needs.

Understand what kind of predictions will help your clients perform their business better.

Define the business problem statement like below.

“The machine learning model will predict the volume of sales for every upcoming quarter.”

“The machine learning model will predict if the claim submitted is fraud or not?”

“The machine learning model will predict whether the loan application is good or bad?”

“The machine learning model will predict what should be the ideal valuation of a home”

Step-2: Identify the target variable in data

Target Variable: That variable/column/feature in data which you want to predict.

Once you have defined the business problem clearly. Next step is to find out the target variable in the data. Sometimes it is very obvious to spot the target variable by name. For example column names like “sales”, “fraud”, “price”, “turnover” are very easy to spot.

But sometimes the databases may not have a friendly naming convention! so you may see the same names with a more technical naming convention. For example, sales may be present with a name like “SAL_DB_CAPT” so it is not intuitive to spot it on one look. Hence, a business dictionary or a Business Analyst will be helpful to spot the sales column in the data which is the target variable.

Identifying Target Variable: How to apply Data Science for any business problem — Identifying Target Variable

Sometimes the databases may not have a friendly naming convention! For example, “sales” can be present with a name like “SAL_DB_CAPT” So it is not intuitive to spot it on one look.

The thumb rule is. Whenever in doubt, quickly take the help of your clients or business analyst. Don’t assume anything unnecessarily!

**Step-3: Choosing the appropriate type of machine learning**

At this point, you know what is the business problem you are solving and the Target Variable type.

So either you are predicting a continuous number like “Sales”, “Profit”, “Demand”, “Turnover”, “Volumes” “Number of Visitors” etc. This is Regression.

Or you are predicting a categorical/Discrete value like “0/1”, “Yes/No”, “Good/Bad”, “Silver/Gold/Platinum” etc. This is Classification.

Choosing the appropriate type of machine learning: How to apply Data Science for any business problem — Regression or Classification?

Regression and classification are two big types of supervised machine learning which is used in the industry currently. Almost every business domain has one or more applications of these two techniques.

Step-4: Remove Useless Variables from data

There are few types of columns/Variables which are simply useless for the machine learning algorithm because they do not hold any patterns with respect to the target variable

For example, an ID column. It is useless because every row has a unique id and hence it does not has any patterns in it. It is wrong to say if the ID is 1002 then the sales will be 20,000 units. However, if you pass this variable as a predictor to the machine learning algorithm, it will try to use ID values in order to predict sales and come up with some garbage logic!
The rule of Garbage in–> Garbage out applies!

Does that mean every column which has unique values for each row is useless?

Here, business knowledge is very critical. Every continuous numeric column has a unique value for almost every row, like Age, Sales, Demand, etc. But some of these are important from the business perspective. These columns contain patterns in them and that is what the machine learning algorithm needs to find out, for example, when the Demand number is 500 units, then the sales are seen at 450 units.

Business domain knowledge helps in distinguishing useful and useless columns

Few more types of columns are also useless like row ID, phone numbers, complete address, comments, description, dates, etc. All these are not useful because each row contains a unique value and hence there will not be any patterns.

Some of these can be indirectly used by deriving certain parts from it. For example, dates cannot be used directly, but the derived month, quarter or week can be used because these may hold some patterns with respect to the target variable. This process of creating new columns from existing columns is known as Feature Engineering.

Remove those columns which do not exhibit any patterns with respect to the target variable

The thumb rule is: Remove those columns which do not exhibit any patterns with respect to the target variable. Do this when you are absolutely sure, otherwise, give the column a chance.

Step-5: Identify ‘potential’ Predictor Variables in data

This step is the most tricky!

Every business problem is driven by certain factors. For example, sales of a shampoo brand is driven by a lot of factors listed below like

How good the product is on a scale of 1 to 10?
How well it is promoted in the market?
How many retail stores it is available in?
Is it available in online stores or not?
Has any movie star endorsed it on TV?
What were sales last year?
How many people purchase it out of loyalty?

The list can go on and on.

The important thing is to identify which one of these factors really affect sales (The Target Variable)?

All these factors which influence the target variable are also known as predictors. Because they might help to predict the value of the Target Variable in the future.

The important thing is to identify which one of these factors really affect the Target Variable?

The next thinkg you may wonder is, where do I begin? How will I know which factors affect the target variable?

This can be solved easily by using a methodical approach listed below:

Talk to business analyst/clients to understand how do they guess/judge the value of the target variable when it is required. For example, How do they estimate sales? What factors do they consider while planning the supply of goods? These discussions will provide a set of variables which are important from the business perspective.
Explore each variable/column based on how its values are distributed. In general, a good variable will follow the normal distribution seen below. Basically, the count of low values will be less, count of medium values will be highest and the count of high values will be less again.

Standard normal distribution (The Bell Curve): How to apply Data Science for any business problem — Standard Normal Distribution (The Bell Curve)

To find the above distribution for every variable, you need to either use a histogram(for continuous numeric variables) or a bar chart (for discrete categorical variables)
The Idea to see the distribution of values in a column. If it is too far away from the ideal bell curve that column may not be useful.
Whenever in doubt about a column. Give it a chance and then measure its importance later on using variable importance charts discussed in step-13

Histogram vs Bar chart:
How to apply Data Science for any business problem — When to use Histogram vs Bar Chart to explore data distribution

Step-6: Treatment of missing data in each one of the predictor variables

Before passing the data to machine learning algorithms, it is mandatory to remove/replace every missing value.

This is required because missing values in data will bias the results produced by the machine learning algorithm.

Missing value treatment is performed separately for each column.

This is also known as Missing data imputation.

How do I replace missing values in data?

There is a neat flow of steps you can follow while treating missing values in any data. These steps are listed below.

If there are more than 30% missing values in a column, remove that column from data. don’t use it for machine learning. If you try to replace 30% missing values by the median, then it will introduce too much bias in the data and affect the results produced by the predictive model.
If n(missing) << n(total rows) then delete rows with missing values. For example, if there are 50,000 rows in the data, and only 10 rows have missing values, then we can safely delete those 10 rows from the data to save time.
If the datatype is Quantitative(Continuous Numeric):
- Impute missing values with the median value of that column
- Another option can be to impute missing values by generating a logical value based on other columns/business rules
If the datatype is Qualitative (Categorical)
- Impute the missing values with mode value(most frequently occurring value) of that column
- Another option is to Impute missing values by generating a logical value based on other columns/business rules

Step-7: Treatment of outliers in data

What are outliers?

Those values which do not follow the trend in data. A simple example could be seen below

100 is an outlier in the list of values [2, 3, 4, 3, 4, 5, 3, 4, 100]. Since most of the values are single digit 100 is an outlier relative to these values.

In short, a value which is abnormally high or low as compared to the group is an Outlier.

Outlier affects predictive modeling very badly.

if you pass a data to a machine learning algorithm with outliers in it. The algorithm has to include that outlier data in its logic. This will introduce errors in the model.

How to treat outliers?

Remove outliers from data: This is the easiest way to treat outliers, but, this method causes data loss. If there is no business explanation for a case. Then It must be removed from the data.
Take log() transform: Computing the log of data will help to remove the effect of outliers. Consider a simple set of values with an outlier in it : [1, 2, 3, 40, 33, 4, 50, 60, 5, 500] if we take the log of this data then it becomes [0.0, 0.7, 1.1, 3.7, 3.5, 1.4, 3.9, 4.1, 1.6, 6.2] and the effect of outliers is removed. It is clearly visible in the below diagram.
Apart from the log(), you can also use square root, cube root, 1/n, etc and select the transformation which is helping best to remove the effect of outliers.

Effect of log Transformation on Outliers

Step-8: Splitting the data. 70% for Training and 30% for Testing by random sampling

First of all why random sampling?

Please refer to this post to understand the need for sampling in data science in detail.

We need to split data into training and testing sets in order to test the performance of the predictive model by asking it unseen questions i.e. giving input from unseen testing data after it has been trained using the training data.

Why 70:30? Can I select 80:20?

Yes! 70:30 or 80:20 or 75:25 all are acceptable ratios of training and testing data. The idea is to select ENOUGH rows in training data which can cover all the type of patterns in data for the model to learn and then testing data also holds enough rows to test the learnings.

Step-9: Creating the model on Training data

70% of the data is called the Training set because it is used to ‘train‘ the algorithm.

Just like a child understands A is for apple identifies it with the shape-size-color of the fruit. Next time if an apple is shown, the child is able to recognize that this is an apple based on the historical learnings. By remembering the images of fruit and comparing it with the new apple shown.

Similarly, machine learning algorithms learn by example.

“If the input is X then the output should be y”.

We show many such examples of X, y pairs to the machine learning algorithm to help it formulate a trend/logic. So, when a new X is input to the model it predicts the value of y.

The more examples we provide to learn, the better the predictive model becomes. This procedure is also known as supervised machine learning. Since you are supervising the learning of the algorithm by showing it input and expected output examples.

Machine learning algorithms learn by example. If the input is X then the output should be y.

Based on the Target Variable, either regression or classification model is created.

If the Target variable is a continuous number, like sales demand, supply, turnover, etc, then a regression model is created. Common supervised machine learning algorithms used are Linear Regression, Decision Trees, Random Forests, Xgboost, etc.

If the Target Variable is a category, like Yes/No, Class A/B/C, gold/silver/platinum, etc, then a classification model is created. Common supervised machine learning algorithms used are Logistic Regression, Decision Trees, Random Forests, Xgboost, K-Nearest Neighbour, Support Vector Machines, etc.

Try all the algorithms and note down the accuracy for each to decide which one of them is most suitable for the given data.

Supervised Machine Learning: Regression Vs Classification

You should try all the algorithms and note down the accuracy for each to decide which one of them is most suitable for the given data.

Step-10: Measuring the accuracy on Testing data

Once we have created the predictive model, its time to test how efficiently it has learned the examples from Training data.

This can be done by comparing the predictions of the model with original values.

The reason this comparison is done on Testing data because the model has not seen the Testing data set examples. If you do this on training data then the accuracy of the model is bound to be high because all the training examples were already seen by the model. In fact, it was built using these training examples!

Measuring the accuracy on Testing data simulates the future scenario when the model is deployed in production. You get an idea about how the model will perform on live data which is new and unseen to it.

How exactly do you measure the accuracy?

Accuracy is measured by computing a few metrics. For regression and classification, it is done differently.

Regression: Using Mean Absolute Percent Error(MAPE) and Median Absolute Percent Error

Classification: Using F1-Score, Precision, Recall, and AUC.

Accuracy Measurement of Predictive Models

These values are calculated for all models created by using different algorithms. A comparison between all values is done to see which algorithm is producing the best accuracy for the given data.

Step-11: n-fold Cross-Validation(Bootstrapping)

There is a chance that while randomly selecting records for Training data, the selected records were the lucky ones with neat and clean patterns, hence the accuracy of the model turned out to be high. Or maybe low in another extreme scenario.

In order to be sure that this accuracy is consistent every time we select a random sample multiple times from full data. Then train the model on new Training data and finally test the model on Testing data.

If you repeat this process 5 times then it is called 5-fold Bootstrapping. So on and so forth.

N-fold Cross Validation — N-fold Bootstrapping

The final accuracy of the predictive model is the average of all accuracies recorded for each cross-validation step.

Hence, you can say the predictive model will perform with 91% accuracy on an average when it is deployed.

Step-12: Finding the importance of each predictor statistically

Which of the used predictors is really affecting the value of Target Variable?

Statistically, it can be measured by using variable importance charts. By looking at it, only the most important features can be selected for final deployment.

Why choose only a few Predictors?

If you keep many predictors in the machine learning model, it will be dependent on those many factors and hence the predictions coming out from the model will be inaccurate many times.

The whole machine learning process is to find out those top factors which affect the target variable most.

A very common method for variable importance chart is present under the Random Forest algorithm. It is neat and easy to use! Works for both regression and classification use cases.

Step-13: Train the predictive model on full data

When you are satisfied with the accuracy of the model and the final set of predictors are selected. It is time to train the predictive model using all the available data.

This is required because you should expose all available types of patterns in data to the predictive model. So that when any one of the similar pattern is encountered in the live environment, the model will be able to predict the answers accurately.

Step-14: Deploy the predictive model in production

In order to deploy the model in production you should follow below steps

Save the model as a serialized file which can be stored anywhere
Create a function which gets integrated with front-end(Tableau/Java Website etc.) to take all the inputs and return the value of Target Variable’s Prediction

The first step is to save the model as a serialized file because right now it is an object in the RAM memory. If you need to share this model with someone it has to be in a file form which can be sent/placed anywhere.

Serialization means converting a memory object in RAM into a physical file

Once you have the serialized model, it can be placed at a server. Then any front end application can access it whenever it is required by using a function.

This function needs to be created as the last part of the deployment. The function will take the input of all the predictor values and pass it to the machine learning algorithm. Then take the prediction and return it to the front end application.

For example, consider below flow. The website is evaluating the price of your bike by asking questions like how many years? How many Kms run and What is the current average? etc

The click of the Predict button will call the Predict Price function
The Predict Price function will access the machine learning serialized model placed at the serve and pass the input to the predictive model
The Predictive Model returns the predicted price
The Predict Price function will send the predicted price to the front end

Conclusion:

The way you apply machine learning matters to the business. It does not matter how many algorithms you have mastered. If you look closely there is always a scope for improvement in any business and it can be a potential use case for machine learning.

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

5 thoughts on “How to apply Data Science for any business problem”

Tanmay Uday Kulkarni
April 11, 2019 at 1:14 pm

Very nice article and gives complete overview of the ML project including deployment part as well.

1. Farukh Hashmi
  April 23, 2019 at 5:02 pm
  
  Thank you, Tanmay! Happy to see it was useful to you.
  
Rajendra Chede
August 31, 2019 at 11:48 am

Thanks sir ;
this one is very much helpfull for all

1. Farukh Hashmi
  August 31, 2019 at 3:35 pm
  
  Thank you, Rajendra!
  
Piyush Sharma
September 24, 2019 at 10:45 am

A simple explanation to the complex concept….This article can give anyone a much needed confidence!!