The Time Series is one of the old gems of Machine Learning. This is a univariate technique that becomes applicable when you have only one column of data available.

If there are multiple columns available then the preferred approach is to use Supervised ML algorithms like Random Forest, XGboost etc. Because more predictor columns will bring more information and hence the predictions will be better.

However, ideal data for supervised ML is not available most of the time and this is when Time Series shines! With just one column of historical data on sales, demand, or any value you want to predict, you can generate reliable predictions.

Take a look at the below data. It will help to visualise the above concept.

**Installing the xlrd library to read data**

xlrd library is required if you are reading xls files using pandas.

1 |
!pip install xlrd |

**Reading the Data**

You can download the data for this case study here.

1 2 3 4 5 6 7 8 |
import pandas as pd import numpy as np # Suppressing scientific notation np.set_printoptions(suppress=True) StoreSalesData=pd.read_excel('/Users/farukh/Downloads/Super Store Sales data.xls') StoreSalesData.head() |

**Looking at the data summary**

1 |
StoreSalesData.info() |

**Problem Statement**

The Superstore management contacted you to look into their sales data and create a predictive model that can tell the sales quantity expected in the next month for each of the categories of products ie. Furniture, Office Supplies and Technology.

**Observing the Quantitative, Qualitative and Categorical variables in data**

1 |
StoreSalesData.nunique() |

In this data, two variables can qualify for time series forecasting. Sales and Quantity.

Sales is the price value and can be derived using Quantity multiplied by the costs associated with products. Hence, Quantity is the raw or base variable and should be selected for time series modelling. There is no harm in selecting Sales as well, however, if you have a choice to model one variable out of the two, I will recommend Quantity as it is not a derived variable.

Now this dataset is not in the format that we can feed directly to time series. Some pre-processing will be required. The time series algorithm expects ONE column of data that we want to forecast further. Since the problem statement says to generate monthly predictions, we will aggregate the Quantity numbers monthly.

**Feature Engineering in data**

This data does not contain any Month or Year field, but it contains ‘Order date’. We will use that and generate Year and Month columns which can be used for our aggregation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Function to get month from a date def Function_get_month(inpDate): return(inpDate.month) # Function to get Year from a date def Function_get_year(inpDate): return(inpDate.year) # Creating new columns StoreSalesData['Month']=StoreSalesData['Order Date'].apply(Function_get_month) StoreSalesData['Year']=StoreSalesData['Order Date'].apply(Function_get_year) StoreSalesData.head() |

Observing the unique Months and years of data present. For time series modelling, at least 2-cycles of data should be present. In this case, we want to predict monthly sales, hence one year will be one complete cycle because all the months from Jan to Dec have appeared in the data and now the months are repeating. So, ideally, at least 2-year worth of data will be needed.

Looking at the below output, we can see we have 4 years of data present, hence, it will be sufficient for the Modelling.

1 2 3 |
# Checking unique values in Year and Month Columns print("Unique Values in Year Column: ", StoreSalesData['Year'].sort_values().unique()) print("Unique Values in Month Column: ", StoreSalesData['Month'].sort_values().unique()) |

**Aggregating Sales Quantity for each month**

Preparing the data for time series modelling. Since the problem statement is to do the prediction monthly, you need to aggregate the data month wise.

1 2 3 4 5 |
# Aggregating the sales quantity for each month for all categories pd.crosstab(columns=StoreSalesData['Month'], index=StoreSalesData['Year'], values=StoreSalesData['Quantity'],x aggfunc='sum') |

The above data is in a matrix format and Time Series expect data in one single column. So by using melt() function we can arrange the data year-month wise in a single column.

1 2 3 4 5 |
# Converting the crosstab data into one single column for Time Series pd.crosstab(columns=StoreSalesData['Year'], index=StoreSalesData['Month'], values=StoreSalesData['Quantity'], aggfunc='sum').melt() |

**Visualizing the Total sales Quantity per month**

1 2 3 4 5 6 7 8 9 10 11 12 13 |
import matplotlib.pyplot as plt SalesQuantity=pd.crosstab(columns=StoreSalesData['Year'], index=StoreSalesData['Month'], values=StoreSalesData['Quantity'], aggfunc='sum').melt()['value'] MonthNames=['Jan','Feb','Mar','Apr','May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']*4 # Plotting the sales %matplotlib inline SalesQuantity.plot(kind='line', figsize=(16,5), title='Total Sales Quantity per month') # Setting the x-axis labels plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30) |

Looking at the values present inside SalesQuantity variable that we plotted above

1 |
SalesQuantity.values |

Each of these numbers is a combination of three factors listed below

**Trend component**: This determines the trend of the time series. Whether the overall trend is increasing or decreasing or constant.**Seasonality component**: This determines the cyclic nature of the data. e.g Ice cream sales will peak in summer and decline in winter. This cycle happens every year due to changing seasons, hence we know when to expect high sales and when to expect low sales every year.**Residue component**: This is also known as the error part of the time series that you cannot explain. Something random happens and increases the sales of an item at an unexpected time of the year. e.g Hand Sanitizer sales increased when Covid-19 broke.

1 2 3 4 5 6 7 8 9 10 11 12 |
# Decomposing the Sales numbers in the Time Series data from statsmodels.tsa.seasonal import seasonal_decompose series = SalesQuantity.values result = seasonal_decompose(series, period=12) #print(result.trend) #print(result.seasonal) #print(result.resid) #print(result.observed) result.plot() CurrentFig=plt.gcf() CurrentFig.set_size_inches(11,8) plt.show() |

**Introduction to the SARIMA Time Series algorithm**

In this case study I am using the seasonal SARIMA (Seasonal ARIMA). It performs better than ARIMA because it models the seasonal patterns separately hence providing more accurate predictions. In the SARIMAX() function below, you can notice two sets of order parameters. **order** and **seasonal_order** both take their own (p,d,q) parameters.

**What are these (p,d,q) parameters in ARIMA time series? **

ARIMA is a short form of **A**uto-Regressive (p) **I**ntegrated(**d**) Moving Average(**q**) models.

**p**: How many Auto Regressive(AR) components to keep in the ARIMA Time Series equation. eg. ARIMA(2,0,0) means there will be two AR terms in the equation as shown below.

**d**: What is the order of differencing in the time series? e.g ARIMA(2,1,0) means there will be 2 AR terms that are differenced at order-1 in the Time Series equation as shown below.

**q**: How many Moving Average(MA) components should be kept in the ARIMA Time Series equation? e.g ARIMA(2,1,3) will have 2 AR components,1st order differencing and 3 MA components in the equation as shown in the below diagram.

When we have the SARIMA model then there are more components in the equation as compared to ARIMA which handles the seasonal part separately. For example, the equation of SARIMA(order=(2,1,3), seasonal_order=(2,0,3)) will look like below.

**How to find the best values of p,d,q?**

The best values of p,d,q parameters can be found by trying out multiple values and measuring the accuracy for each combination. The combination that generates the highest accuracy is the best one. This process is also known as hyperparameter tuning!

Since we are using SARIMA, here we need to tune 6 parameters! (p,d,q) for order and (p,d,q) for seasonal_order

**Function to find best p,d,q parameters**

The below function finds the best p,d,q parameters for the SARIMA model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# Creating the function to find best values of p,d,q for SARIMA def FunctionTuneArima(inpData, p_values, d_values, q_values, seasonal_p_values, seasonal_d_values, seasonal_q_values,cycle): # Supressing warning messages import warnings warnings.filterwarnings(action='ignore') from statsmodels.tsa.statespace.sarimax import SARIMAX # Fitting the model for each set of values passed # Creating an empty data frame to store Results=pd.DataFrame() # Trying the values for p_value in p_values: for d_value in d_values: for q_value in q_values: for seasonal_p_value in seasonal_p_values: for seasonal_d_value in seasonal_d_values: for seasonal_q_value in seasonal_q_values: try: model = SARIMAX(inpData, order=(p_value,d_value,q_value), seasonal_order =(seasonal_p_value, seasonal_d_value, seasonal_q_value, cycle)) model_fit=model.fit(disp=False) pred = model_fit.predict(0, len(inpData)) Acc=100- np.mean(abs(pred-inpData)/inpData*100) Results=pd.concat([Results,pd.DataFrame([[p_value, d_value, q_value, seasonal_p_value, seasonal_d_value, seasonal_q_value, Acc]], columns=["p","d","q", "seasonal_p", "seasonal_d", "seasonal_q", "Accuracy" ] )]) except: pass return(Results) |

1 2 3 4 5 6 7 8 9 10 11 12 |
# Calling the function to get the best values # This can take some time because there are multiple combinations! # Cycle=12 because this is monthly data ResultsData=FunctionTuneArima(inpData=SalesQuantity, p_values=[0,1], d_values=[0,1], q_values=[1,10], seasonal_p_values=[1,2], seasonal_d_values=[0], seasonal_q_values=[0], cycle=12 ) |

1 2 |
# Sorting the results to get the 10 best combinations ResultsData.sort_values('Accuracy', ascending=False).head(10) |

Based on the above result we can see the best values for order(p,d,q)= (1,0,10) and for seasonal_order(p,d,q) = (2,0,0). The Cycle=12 is constant because this is monthly data.

Using these best hyperparameters and creating the Time Series model below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# Importing the algorithm from statsmodels.tsa.statespace.sarimax import SARIMAX import warnings warnings.filterwarnings('ignore') # Train the model on the full dataset SarimaxModel = SARIMAX(SalesQuantity, order = (1, 0, 10), seasonal_order =(2, 0, 0, 12)) # Fitting the Time series model on the data SalesModel = SarimaxModel.fit(disp=False) # Forecast for the next 6 months FutureMonths=6 forecast = SalesModel.predict(start = 0, end = (len(SalesQuantity)) + FutureMonths, typ = 'levels').rename('Forecast') print("Next Six Month Forecast:\n",forecast[-FutureMonths:]) # Plot the forecast values SalesQuantity.plot(figsize = (18, 5), legend = True, title='Time Series Sales Forecasts') forecast.plot(legend = True, figsize=(18,5)) # Measuring the Training accuracy of the model MAPE=np.mean(abs(SalesQuantity-forecast)/SalesQuantity)*100 print('#### Accuracy of model:', round(100-MAPE,2), '####') # Printing month names in X-Axis PlotMonthNames=MonthNames+MonthNames[0:FutureMonths] plotLabels=plt.xticks(np.arange(0,len(PlotMonthNames),1),PlotMonthNames, rotation=30) |

The above forecast is for the unit sales of ALL the categories in the next 6 months. The training accuracy is 79%.

**For how many future months can I do the predictions using Time Series?**

You can do the predictions for any number of months for example the next 3 months, 11 months, 24 months etc. However, the recommendation is not more than 6-months. The reason is that we will have many data changes every month, hence ideally the model should be retrained every month with the latest data and then the future predictions should be revised.

**How can I improve the predictions?**

The first logical step is to bring more data. Especially since this is a univariate modelling technique, the more number of years worth of data the better it is!

If you don’t have additional data, then drilling down into categories/sub-categories may help to capture the data patterns better. For example, we have created a Time series above that captures the sales patterns for all the categories put together. If we create one Time series for one category then it will do better predictions.

In the below aggregation, you can observe the data at the category level. Based on the total sales we can prioritize which category’s Time Series should be created first.

You can also do data transformation like Square root, log of the SalesQuantity variable to improve the model fit accuracy and the prediction accuracy.

1 2 |
# Aggregating the data at Category level StoreSalesData.groupby(['Category']).sum() |

**Time Series only for the Office Supplies category**

1 2 |
# Filtering only Office Supplies data OfficeSupplySalesData=StoreSalesData[StoreSalesData['Category']=='Office Supplies'] |

1 2 3 4 5 |
# Aggregating the sales quantity for each month for all categories pd.crosstab(columns=OfficeSupplySalesData['Month'], index=OfficeSupplySalesData['Year'], values=OfficeSupplySalesData['Quantity'], aggfunc='sum') |

**Visualizing the Sales data for the Office Supplies Category**

Once we have filtered the data only for the Office Supplies category, you can observe slightly different trends in the data, and that makes sense because now you are focussing on the sales of one particular category.

1 2 3 4 5 6 7 8 9 10 11 12 |
import matplotlib.pyplot as plt SalesQuantity=pd.crosstab(columns=OfficeSupplySalesData['Year'], index=OfficeSupplySalesData['Month'], values=OfficeSupplySalesData['Quantity'], aggfunc='sum').melt()['value'] MonthNames=['Jan','Feb','Mar','Apr','May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']*4 # Plotting the sales SalesQuantity.plot(kind='line', figsize=(16,5), title='Total Sales Quantity per month for Office Supplies Category') # Setting the x-axis labels plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30) |

**Taking Log of SalesQuantity**

Taking a log is one of the top data transformations for continuous variables in Machine Learning while trying to improve the model fit, just make sure to check there are no zero values in the data before taking LOG. Here, taking the log of the Sales Quantity column will improve the model fit as well as the accuracy.

1 2 3 4 5 6 7 |
## Taking Log of the SalesQuantity SalesQuantityLOG=np.log(SalesQuantity) # Plotting the sales quantity after LOG transformation SalesQuantityLOG.plot(kind='line', figsize=(16,5), title='Total Log(SalesQuantity) per month for Office Supplies Category') # Setting the x-axis labels plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30) |

Notice the y-axis here, the values are now small due to the log transformation, however, the data pattern is exactly the same as it was before taking the log.

**Finding best hyperparameters(p,d,q) for SARIMA model**

Calling the same function that we defined above to find the best hyperparameters for SARIMA model of the Office supplies category.

1 2 3 4 5 6 7 8 9 10 11 12 |
# Calling the function defined above to get the best values # This can take some time because there are multiple combinations! # Cycle=12 because this is monthly data ResultsData=FunctionTuneArima(inpData=SalesQuantityLOG, p_values=[0,10], d_values=[0], q_values=[0,5], seasonal_p_values=[0,2], seasonal_d_values=[0], seasonal_q_values=[0,1,3], cycle=12 ) |

1 2 |
# Sorting the results to get the 10 best values ResultsData.sort_values('Accuracy', ascending=False).head(10) |

Notice that the accuracy numbers are very high after the log transformation! However, it is important to understand this is not the final accuracy because these numbers were achieved on a transformed data. You will have to convert the data to original scale before computing accuracy as shown in the below code.

**Creating the Time Series for Office Supplies Category**

Creating Time Series only for the Office supplies Category using the best hyperparameters found in the above step as **order** = (10, 0, 5) and **seasonal_order** =(2, 0, 3)). The cycle remains 12 as this is a monthly Time Series.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Importing the algorithm from statsmodels.tsa.statespace.sarimax import SARIMAX import warnings warnings.filterwarnings('ignore') # Train the model on the full dataset SarimaxModel = model = SARIMAX(SalesQuantityLOG, order = (10, 0, 5), seasonal_order =(2, 0, 3, 12)) # Fitting the Time series model on the LOG scale data SalesModel = SarimaxModel.fit(disp=False) # Forecast for the next 6 months FutureMonths=6 forecast = SalesModel.predict(start = 0, end = (len(SalesQuantityLOG)) + FutureMonths, typ = 'levels').rename('Forecast') # Bringing the forecast to the original scale by taking antilog forecast=np.exp(forecast) print("Next Six Month Forecast:\n",forecast[-FutureMonths:]) # Plot the forecast values # Using Original scale SalesQuantity for plot SalesQuantity.plot(figsize = (18, 5), legend = True, title='Time Series Sales Forecasts') forecast.plot(legend = True, figsize=(18,5)) # Measuring the Training accuracy of the model # Using Original scale SalesQuantity for accuracy computation MAPE=np.mean(abs(SalesQuantity-forecast)/SalesQuantity)*100 print('#### Accuracy of model:', round(100-MAPE,2), '####') # Printing month names in X-Axis PlotMonthNames=MonthNames+MonthNames[0:FutureMonths] plotLabels=plt.xticks(np.arange(0,len(PlotMonthNames),1),PlotMonthNames, rotation=30) |

**Conclusion**

I hope this post taught you Time Series Modelling from a different perspective!

Keep in mind that choosing the right algorithm for a business problem is half the battle won! In this case study, we chose Time Series because we had the business problem of predicting the Sales Quantity for the next few months and in the given data most of the columns were Qualitative except for Sales and Quantity. Since Sales is derived using Quantity, hence we used Quantity as the variable to fit Time Series because its a RAW variable and its preferred over a derived variable. As a good exercise, you can try the same template using the Sales variable and see if the accuracy increases!

Thank you for your time on this website! If you found it useful, please share it with your friends to spread the knowledge!

Punam SahaHello Faruk,

Can you please upload a video on Time series? t would really help to understand Time Series more accurately from your explanation.

Farukh HashmiHi Punam,

Sure, I will make this video soon.

Thank you for the kind words!

Amol RathodHi Sir,

I need help with interview preparation.