Time Series Forecasting: A Case Study in Python for Superstore Dataset

Time Series Case Study in Python

The Time Series is one of the old gems of Machine Learning. This is a univariate technique that becomes applicable when you have only one column of data available.

If there are multiple columns available then the preferred approach is to use Supervised ML algorithms like Random Forest, XGboost etc. Because more predictor columns will bring more information and hence the predictions will be better.

However, ideal data for supervised ML is not available most of the time and this is when Time Series shines! With just one column of historical data on sales, demand, or any value you want to predict, you can generate reliable predictions.

Take a look at the below data. It will help to visualise the above concept.

Installing the xlrd library to read data

xlrd library is required if you are reading xls files using pandas.

time series forecasting

Reading the Data

You can download the data for this case study here.

time series dataset sample output

Looking at the data summary

time series dataset summary

Problem Statement

The Superstore management contacted you to look into their sales data and create a predictive model that can tell the sales quantity expected in the next month for each of the categories of products ie. Furniture, Office Supplies and Technology.

Observing the Quantitative, Qualitative and Categorical variables in data

Time Series Data analysis

In this data, two variables can qualify for time series forecasting. Sales and Quantity.

Sales is the price value and can be derived using Quantity multiplied by the costs associated with products. Hence, Quantity is the raw or base variable and should be selected for time series modelling. There is no harm in selecting Sales as well, however, if you have a choice to model one variable out of the two, I will recommend Quantity as it is not a derived variable.

Now this dataset is not in the format that we can feed directly to time series. Some pre-processing will be required. The time series algorithm expects ONE column of data that we want to forecast further. Since the problem statement says to generate monthly predictions, we will aggregate the Quantity numbers monthly.

Feature Engineering in data

This data does not contain any Month or Year field, but it contains ‘Order date’. We will use that and generate Year and Month columns which can be used for our aggregation.

Time Series Feature Engineering

Observing the unique Months and years of data present. For time series modelling, at least 2-cycles of data should be present. In this case, we want to predict monthly sales, hence one year will be one complete cycle because all the months from Jan to Dec have appeared in the data and now the months are repeating. So, ideally, at least 2-year worth of data will be needed.

Looking at the below output, we can see we have 4 years of data present, hence, it will be sufficient for the Modelling.

Time Series Feature Engineering Month and Year

Aggregating Sales Quantity for each month

Preparing the data for time series modelling. Since the problem statement is to do the prediction monthly, you need to aggregate the data month wise.

Time Series Monthly aggregation of data

The above data is in a matrix format and Time Series expect data in one single column. So by using melt() function we can arrange the data year-month wise in a single column.

Visualizing the Total sales Quantity per month

Time Series monthly data plot

Looking at the values present inside SalesQuantity variable that we plotted above

SalesQuantity values in Time Series

Each of these numbers is a combination of three factors listed below

  1. Trend component: This determines the trend of the time series. Whether the overall trend is increasing or decreasing or constant.
  2. Seasonality component: This determines the cyclic nature of the data. e.g Ice cream sales will peak in summer and decline in winter. This cycle happens every year due to changing seasons, hence we know when to expect high sales and when to expect low sales every year.
  3. Residue component: This is also known as the error part of the time series that you cannot explain. Something random happens and increases the sales of an item at an unexpected time of the year. e.g Hand Sanitizer sales increased when Covid-19 broke.

Decomposition of Time Series Sales Data

Introduction to the SARIMA Time Series algorithm

In this case study I am using the seasonal SARIMA (Seasonal ARIMA). It performs better than ARIMA because it models the seasonal patterns separately hence providing more accurate predictions. In the SARIMAX() function below, you can notice two sets of order parameters. order and seasonal_order both take their own (p,d,q) parameters.

What are these (p,d,q) parameters in ARIMA time series?

ARIMA is a short form of Auto-Regressive (p) Integrated(d) Moving Average(q) models.

ARIMA Time Series parameters

  • p: How many Auto Regressive(AR) components to keep in the ARIMA Time Series equation. eg. ARIMA(2,0,0) means there will be two AR terms in the equation as shown below.

  • d: What is the order of differencing in the time series? e.g ARIMA(2,1,0) means there will be 2 AR terms that are differenced at order-1 in the Time Series equation as shown below.

ARIMA Time Series Equation

  • q: How many Moving Average(MA) components should be kept in the ARIMA Time Series equation? e.g ARIMA(2,1,3) will have 2 AR components,1st order differencing and 3 MA components in the equation as shown in the below diagram.

ARIMA Time Series Equation

When we have the SARIMA model then there are more components in the equation as compared to ARIMA which handles the seasonal part separately. For example, the equation of SARIMA(order=(2,1,3), seasonal_order=(2,0,3)) will look like below.

SARIMA Time Series Equation

How to find the best values of p,d,q?

The best values of p,d,q parameters can be found by trying out multiple values and measuring the accuracy for each combination. The combination that generates the highest accuracy is the best one. This process is also known as hyperparameter tuning!

Since we are using SARIMA, here we need to tune 6 parameters! (p,d,q) for order and (p,d,q) for seasonal_order

Function to find best p,d,q parameters

The below function finds the best p,d,q parameters for the SARIMA model.

SARIMA Time Series hyperparameters tuning results

Based on the above result we can see the best values for order(p,d,q)= (1,0,10) and for seasonal_order(p,d,q) = (2,0,0). The Cycle=12 is constant because this is monthly data.

Using these best hyperparameters and creating the Time Series model below.

Time Series forecast accuracy

The above forecast is for the unit sales of ALL the categories in the next 6 months. The training accuracy is 79%.

For how many future months can I do the predictions using Time Series?

You can do the predictions for any number of months for example the next 3 months, 11 months, 24 months etc. However, the recommendation is not more than 6-months. The reason is that we will have many data changes every month, hence ideally the model should be retrained every month with the latest data and then the future predictions should be revised.

How can I improve the predictions?

The first logical step is to bring more data. Especially since this is a univariate modelling technique, the more number of years worth of data the better it is!

If you don’t have additional data, then drilling down into categories/sub-categories may help to capture the data patterns better. For example, we have created a Time series above that captures the sales patterns for all the categories put together. If we create one Time series for one category then it will do better predictions.

In the below aggregation, you can observe the data at the category level. Based on the total sales we can prioritize which category’s Time Series should be created first.

You can also do data transformation like Square root, log of the SalesQuantity variable to improve the model fit accuracy and the prediction accuracy.

Time Series category wise data agggregation

Time Series only for the Office Supplies category

Office Supply Time Series Data

Visualizing the Sales data for the Office Supplies Category

Once we have filtered the data only for the Office Supplies category, you can observe slightly different trends in the data, and that makes sense because now you are focussing on the sales of one particular category.

Time Series data for Office supplies Category

Taking Log of SalesQuantity

Taking a log is one of the top data transformations for continuous variables in Machine Learning while trying to improve the model fit, just make sure to check there are no zero values in the data before taking LOG. Here, taking the log of the Sales Quantity column will improve the model fit as well as the accuracy.

Log of Time Series Data

Notice the y-axis here, the values are now small due to the log transformation, however, the data pattern is exactly the same as it was before taking the log.

Finding best hyperparameters(p,d,q) for SARIMA model

Calling the same function that we defined above to find the best hyperparameters for SARIMA model of the Office supplies category.

Hyperparameter tuning for SARIMA model

Notice that the accuracy numbers are very high after the log transformation! However, it is important to understand this is not the final accuracy because these numbers were achieved on a transformed data. You will have to convert the data to original scale before computing accuracy as shown in the below code.

Creating the Time Series for Office Supplies Category

Creating Time Series only for the Office supplies Category using the best hyperparameters found in the above step as order = (10, 0, 5) and
seasonal_order =(2, 0, 3)). The cycle remains 12 as this is a monthly Time Series.

Time Series forecast only for Office Supplies category

Conclusion

I hope this post taught you Time Series Modelling from a different perspective!

Keep in mind that choosing the right algorithm for a business problem is half the battle won! In this case study, we chose Time Series because we had the business problem of predicting the Sales Quantity for the next few months and in the given data most of the columns were Qualitative except for Sales and Quantity. Since Sales is derived using Quantity, hence we used Quantity as the variable to fit Time Series because its a RAW variable and its preferred over a derived variable. As a good exercise, you can try the same template using the Sales variable and see if the accuracy increases!

Thank you for your time on this website! If you found it useful, please share it with your friends to spread the knowledge!

Author Details
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

3 thoughts on “Time Series Forecasting: A Case Study in Python for Superstore Dataset”

  1. Hello Faruk,
    Can you please upload a video on Time series? t would really help to understand Time Series more accurately from your explanation.

Leave a Reply!

Your email address will not be published. Required fields are marked *