Time Series Forecasting: A Case Study in Python for Superstore Dataset

The Time Series is one of the old gems of Machine Learning. This is a univariate technique that becomes applicable when you have only one column of data available.

If there are multiple columns available then the preferred approach is to use Supervised ML algorithms like Random Forest, XGboost etc. Because more predictor columns will bring more information and hence the predictions will be better.

However, ideal data for supervised ML is not available most of the time and this is when Time Series shines! With just one column of historical data on sales, demand, or any value you want to predict, you can generate reliable predictions.

Take a look at the below data. It will help to visualise the above concept.

Installing the xlrd library to read data

xlrd library is required if you are reading xls files using pandas.

!pip install xlrd

1	!pip install xlrd

Reading the Data

You can download the data for this case study here.

import pandas as pd
import numpy as np

# Suppressing scientific notation
np.set_printoptions(suppress=True)

StoreSalesData=pd.read_excel('/Users/farukh/Downloads/Super Store Sales data.xls')
StoreSalesData.head()

import pandas as pd

import numpy as np

# Suppressing scientific notation

np.set_printoptions(suppress=True)

StoreSalesData=pd.read_excel('/Users/farukh/Downloads/Super Store Sales data.xls')

StoreSalesData.head()

Looking at the data summary

StoreSalesData.info()

1	StoreSalesData.info()

Problem Statement

The Superstore management contacted you to look into their sales data and create a predictive model that can tell the sales quantity expected in the next month for each of the categories of products ie. Furniture, Office Supplies and Technology.

Observing the Quantitative, Qualitative and Categorical variables in data

StoreSalesData.nunique()

1	StoreSalesData.nunique()

In this data, two variables can qualify for time series forecasting. Sales and Quantity.

Sales is the price value and can be derived using Quantity multiplied by the costs associated with products. Hence, Quantity is the raw or base variable and should be selected for time series modelling. There is no harm in selecting Sales as well, however, if you have a choice to model one variable out of the two, I will recommend Quantity as it is not a derived variable.

Now this dataset is not in the format that we can feed directly to time series. Some pre-processing will be required. The time series algorithm expects ONE column of data that we want to forecast further. Since the problem statement says to generate monthly predictions, we will aggregate the Quantity numbers monthly.

Feature Engineering in data

This data does not contain any Month or Year field, but it contains ‘Order date’. We will use that and generate Year and Month columns which can be used for our aggregation.

# Function to get month from a date
def Function_get_month(inpDate):
    return(inpDate.month)

# Function to get Year from a date
def Function_get_year(inpDate):
    return(inpDate.year)


# Creating new columns
StoreSalesData['Month']=StoreSalesData['Order Date'].apply(Function_get_month)
StoreSalesData['Year']=StoreSalesData['Order Date'].apply(Function_get_year)

StoreSalesData.head()

# Function to get month from a date

def Function_get_month(inpDate):

return(inpDate.month)

# Function to get Year from a date

def Function_get_year(inpDate):

return(inpDate.year)

# Creating new columns

StoreSalesData['Month']=StoreSalesData['Order Date'].apply(Function_get_month)

StoreSalesData['Year']=StoreSalesData['Order Date'].apply(Function_get_year)

StoreSalesData.head()

Observing the unique Months and years of data present. For time series modelling, at least 2-cycles of data should be present. In this case, we want to predict monthly sales, hence one year will be one complete cycle because all the months from Jan to Dec have appeared in the data and now the months are repeating. So, ideally, at least 2-year worth of data will be needed.

Looking at the below output, we can see we have 4 years of data present, hence, it will be sufficient for the Modelling.

# Checking unique values in Year and Month Columns
print("Unique Values in Year Column: ", StoreSalesData['Year'].sort_values().unique())
print("Unique Values in Month Column: ", StoreSalesData['Month'].sort_values().unique())

# Checking unique values in Year and Month Columns

print("Unique Values in Year Column: ", StoreSalesData['Year'].sort_values().unique())

print("Unique Values in Month Column: ", StoreSalesData['Month'].sort_values().unique())

Aggregating Sales Quantity for each month

Preparing the data for time series modelling. Since the problem statement is to do the prediction monthly, you need to aggregate the data month wise.

# Aggregating the sales quantity for each month for all categories
pd.crosstab(columns=StoreSalesData['Month'],
            index=StoreSalesData['Year'],
            values=StoreSalesData['Quantity'],x
            aggfunc='sum')

# Aggregating the sales quantity for each month for all categories

pd.crosstab(columns=StoreSalesData['Month'],

index=StoreSalesData['Year'],

values=StoreSalesData['Quantity'],x

aggfunc='sum')

The above data is in a matrix format and Time Series expect data in one single column. So by using melt() function we can arrange the data year-month wise in a single column.

# Converting the crosstab data into one single column for Time Series
pd.crosstab(columns=StoreSalesData['Year'],
            index=StoreSalesData['Month'],
            values=StoreSalesData['Quantity'],
            aggfunc='sum').melt()

# Converting the crosstab data into one single column for Time Series

pd.crosstab(columns=StoreSalesData['Year'],

index=StoreSalesData['Month'],

values=StoreSalesData['Quantity'],

aggfunc='sum').melt()

Visualizing the Total sales Quantity per month

import matplotlib.pyplot as plt
SalesQuantity=pd.crosstab(columns=StoreSalesData['Year'],
            index=StoreSalesData['Month'],
            values=StoreSalesData['Quantity'],
            aggfunc='sum').melt()['value']

MonthNames=['Jan','Feb','Mar','Apr','May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']*4

# Plotting the sales
%matplotlib inline
SalesQuantity.plot(kind='line', figsize=(16,5), title='Total Sales Quantity per month')
# Setting the x-axis labels
plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30)

import matplotlib.pyplot as plt

SalesQuantity=pd.crosstab(columns=StoreSalesData['Year'],

index=StoreSalesData['Month'],

values=StoreSalesData['Quantity'],

aggfunc='sum').melt()['value']

MonthNames=['Jan','Feb','Mar','Apr','May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']*4

# Plotting the sales

%matplotlib inline

SalesQuantity.plot(kind='line', figsize=(16,5), title='Total Sales Quantity per month')

# Setting the x-axis labels

plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30)

Looking at the values present inside SalesQuantity variable that we plotted above

SalesQuantity.values

1	SalesQuantity.values

Each of these numbers is a combination of three factors listed below

Trend component: This determines the trend of the time series. Whether the overall trend is increasing or decreasing or constant.

Seasonality component: This determines the cyclic nature of the data. e.g Ice cream sales will peak in summer and decline in winter. This cycle happens every year due to changing seasons, hence we know when to expect high sales and when to expect low sales every year.

Residue component: This is also known as the error part of the time series that you cannot explain. Something random happens and increases the sales of an item at an unexpected time of the year. e.g Hand Sanitizer sales increased when Covid-19 broke.

# Decomposing the Sales numbers in the Time Series data
from statsmodels.tsa.seasonal import seasonal_decompose
series = SalesQuantity.values
result = seasonal_decompose(series, period=12)
#print(result.trend)
#print(result.seasonal)
#print(result.resid)
#print(result.observed)
result.plot()
CurrentFig=plt.gcf()
CurrentFig.set_size_inches(11,8)
plt.show()

# Decomposing the Sales numbers in the Time Series data

from statsmodels.tsa.seasonal import seasonal_decompose

series = SalesQuantity.values

result = seasonal_decompose(series, period=12)

#print(result.trend)

#print(result.seasonal)

#print(result.resid)

#print(result.observed)

result.plot()

CurrentFig=plt.gcf()

CurrentFig.set_size_inches(11,8)

plt.show()

Introduction to the SARIMA Time Series algorithm

In this case study I am using the seasonal SARIMA (Seasonal ARIMA). It performs better than ARIMA because it models the seasonal patterns separately hence providing more accurate predictions. In the SARIMAX() function below, you can notice two sets of order parameters. order and seasonal_order both take their own (p,d,q) parameters.

What are these (p,d,q) parameters in ARIMA time series?

ARIMA is a short form of Auto-Regressive (p) Integrated(d) Moving Average(q) models.

p: How many Auto Regressive(AR) components to keep in the ARIMA Time Series equation. eg. ARIMA(2,0,0) means there will be two AR terms in the equation as shown below.

d: What is the order of differencing in the time series? e.g ARIMA(2,1,0) means there will be 2 AR terms that are differenced at order-1 in the Time Series equation as shown below.

q: How many Moving Average(MA) components should be kept in the ARIMA Time Series equation? e.g ARIMA(2,1,3) will have 2 AR components,1st order differencing and 3 MA components in the equation as shown in the below diagram.

When we have the SARIMA model then there are more components in the equation as compared to ARIMA which handles the seasonal part separately. For example, the equation of SARIMA(order=(2,1,3), seasonal_order=(2,0,3)) will look like below.

How to find the best values of p,d,q?

The best values of p,d,q parameters can be found by trying out multiple values and measuring the accuracy for each combination. The combination that generates the highest accuracy is the best one. This process is also known as hyperparameter tuning!

Since we are using SARIMA, here we need to tune 6 parameters! (p,d,q) for order and (p,d,q) for seasonal_order

Function to find best p,d,q parameters

The below function finds the best p,d,q parameters for the SARIMA model.

# Creating the function to find best values of p,d,q for SARIMA
def FunctionTuneArima(inpData, p_values, d_values, q_values, 
                      seasonal_p_values, seasonal_d_values, seasonal_q_values,cycle):
    # Supressing warning messages
    import warnings
    warnings.filterwarnings(action='ignore')
    
    from statsmodels.tsa.statespace.sarimax import SARIMAX
    # Fitting the model for each set of values passed
    
    # Creating an empty data frame to store
    Results=pd.DataFrame()
    
    # Trying the values
    for p_value in p_values:
        for d_value in d_values:
            for q_value in q_values:
                for seasonal_p_value in seasonal_p_values:
                    for seasonal_d_value in seasonal_d_values:
                        for seasonal_q_value in seasonal_q_values:
                            
                            try:
                                model = SARIMAX(inpData, 
                                    order=(p_value,d_value,q_value), 
                                    seasonal_order =(seasonal_p_value, 
                                                     seasonal_d_value,
                                                     seasonal_q_value,
                                                     cycle))
                                model_fit=model.fit(disp=False)
                                pred = model_fit.predict(0, len(inpData))
                                Acc=100- np.mean(abs(pred-inpData)/inpData*100)
                                
                                Results=pd.concat([Results,pd.DataFrame([[p_value,
                                                                   d_value,
                                                                   q_value,
                                                                   seasonal_p_value,
                                                                   seasonal_d_value,
                                                                   seasonal_q_value,
                                                                   Acc]],
                                                                   columns=["p","d","q",
                                                                           "seasonal_p",
                                                                           "seasonal_d",
                                                                           "seasonal_q",
                                                                           "Accuracy"
                                                                           ] )])

                            except:
                                pass
    return(Results)

# Creating the function to find best values of p,d,q for SARIMA

def FunctionTuneArima(inpData, p_values, d_values, q_values,

seasonal_p_values, seasonal_d_values, seasonal_q_values,cycle):

# Supressing warning messages

import warnings

warnings.filterwarnings(action='ignore')

from statsmodels.tsa.statespace.sarimax import SARIMAX

# Fitting the model for each set of values passed

# Creating an empty data frame to store

Results=pd.DataFrame()

# Trying the values

for p_value in p_values:

for d_value in d_values:

for q_value in q_values:

for seasonal_p_value in seasonal_p_values:

for seasonal_d_value in seasonal_d_values:

for seasonal_q_value in seasonal_q_values:

try:

model = SARIMAX(inpData,

order=(p_value,d_value,q_value),

seasonal_order =(seasonal_p_value,

seasonal_d_value,

seasonal_q_value,

cycle))

model_fit=model.fit(disp=False)

pred = model_fit.predict(0, len(inpData))

Acc=100- np.mean(abs(pred-inpData)/inpData*100)

Results=pd.concat([Results,pd.DataFrame([[p_value,

d_value,

q_value,

seasonal_p_value,

seasonal_d_value,

seasonal_q_value,

Acc]],

columns=["p","d","q",

"seasonal_p",

"seasonal_d",

"seasonal_q",

"Accuracy"

] )])

except:

pass

return(Results)

# Calling the function to get the best values
# This can take some time because there are multiple combinations!
# Cycle=12 because this is monthly data
ResultsData=FunctionTuneArima(inpData=SalesQuantity,
                  p_values=[0,1], 
                  d_values=[0,1], 
                  q_values=[1,10], 
                  seasonal_p_values=[1,2],                                       
                  seasonal_d_values=[0], 
                  seasonal_q_values=[0],
                  cycle=12
                 )

# Calling the function to get the best values

# This can take some time because there are multiple combinations!

# Cycle=12 because this is monthly data

ResultsData=FunctionTuneArima(inpData=SalesQuantity,

p_values=[0,1],

d_values=[0,1],

q_values=[1,10],

seasonal_p_values=[1,2],

seasonal_d_values=[0],

seasonal_q_values=[0],

cycle=12

)

# Sorting the results to get the 10 best combinations
ResultsData.sort_values('Accuracy', ascending=False).head(10)

1 2	# Sorting the results to get the 10 best combinations ResultsData.sort_values('Accuracy', ascending=False).head(10)

SARIMA Time Series hyperparameters tuning results

Based on the above result we can see the best values for order(p,d,q)= (1,0,10) and for seasonal_order(p,d,q) = (2,0,0). The Cycle=12 is constant because this is monthly data.

Using these best hyperparameters and creating the Time Series model below.

# Importing the algorithm
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
warnings.filterwarnings('ignore')


# Train the model on the full dataset 
SarimaxModel = SARIMAX(SalesQuantity,  
                        order = (1, 0, 10),  
                        seasonal_order =(2, 0, 0, 12))

 # Fitting the Time series model on the data
SalesModel = SarimaxModel.fit(disp=False)
  
# Forecast for the next 6 months
FutureMonths=6
forecast = SalesModel.predict(start = 0,
                          end = (len(SalesQuantity)) + FutureMonths,
                          typ = 'levels').rename('Forecast')
print("Next Six Month Forecast:\n",forecast[-FutureMonths:])

# Plot the forecast values
SalesQuantity.plot(figsize = (18, 5), legend = True, title='Time Series Sales Forecasts')
forecast.plot(legend = True, figsize=(18,5))

# Measuring the Training accuracy of the model
MAPE=np.mean(abs(SalesQuantity-forecast)/SalesQuantity)*100
print('#### Accuracy of model:', round(100-MAPE,2), '####')

# Printing month names in X-Axis
PlotMonthNames=MonthNames+MonthNames[0:FutureMonths]
plotLabels=plt.xticks(np.arange(0,len(PlotMonthNames),1),PlotMonthNames, rotation=30)

# Importing the algorithm

from statsmodels.tsa.statespace.sarimax import SARIMAX

import warnings

warnings.filterwarnings('ignore')

# Train the model on the full dataset

SarimaxModel = SARIMAX(SalesQuantity,

order = (1, 0, 10),

seasonal_order =(2, 0, 0, 12))

# Fitting the Time series model on the data

SalesModel = SarimaxModel.fit(disp=False)

# Forecast for the next 6 months

FutureMonths=6

forecast = SalesModel.predict(start = 0,

end = (len(SalesQuantity)) + FutureMonths,

typ = 'levels').rename('Forecast')

print("Next Six Month Forecast:\n",forecast[-FutureMonths:])

# Plot the forecast values

SalesQuantity.plot(figsize = (18, 5), legend = True, title='Time Series Sales Forecasts')

forecast.plot(legend = True, figsize=(18,5))

# Measuring the Training accuracy of the model

MAPE=np.mean(abs(SalesQuantity-forecast)/SalesQuantity)*100

print('#### Accuracy of model:', round(100-MAPE,2), '####')

# Printing month names in X-Axis

PlotMonthNames=MonthNames+MonthNames[0:FutureMonths]

plotLabels=plt.xticks(np.arange(0,len(PlotMonthNames),1),PlotMonthNames, rotation=30)

The above forecast is for the unit sales of ALL the categories in the next 6 months. The training accuracy is 79%.

For how many future months can I do the predictions using Time Series?

You can do the predictions for any number of months for example the next 3 months, 11 months, 24 months etc. However, the recommendation is not more than 6-months. The reason is that we will have many data changes every month, hence ideally the model should be retrained every month with the latest data and then the future predictions should be revised.

How can I improve the predictions?

The first logical step is to bring more data. Especially since this is a univariate modelling technique, the more number of years worth of data the better it is!

If you don’t have additional data, then drilling down into categories/sub-categories may help to capture the data patterns better. For example, we have created a Time series above that captures the sales patterns for all the categories put together. If we create one Time series for one category then it will do better predictions.

In the below aggregation, you can observe the data at the category level. Based on the total sales we can prioritize which category’s Time Series should be created first.

You can also do data transformation like Square root, log of the SalesQuantity variable to improve the model fit accuracy and the prediction accuracy.

# Aggregating the data at Category level
StoreSalesData.groupby(['Category']).sum()

1 2	# Aggregating the data at Category level StoreSalesData.groupby(['Category']).sum()

Time Series category wise data agggregation

Time Series only for the Office Supplies category

# Filtering only Office Supplies data
OfficeSupplySalesData=StoreSalesData[StoreSalesData['Category']=='Office Supplies']

1 2	# Filtering only Office Supplies data OfficeSupplySalesData=StoreSalesData[StoreSalesData['Category']=='Office Supplies']

# Aggregating the sales quantity for each month for all categories
pd.crosstab(columns=OfficeSupplySalesData['Month'],
            index=OfficeSupplySalesData['Year'],
            values=OfficeSupplySalesData['Quantity'],
            aggfunc='sum')

# Aggregating the sales quantity for each month for all categories

pd.crosstab(columns=OfficeSupplySalesData['Month'],

index=OfficeSupplySalesData['Year'],

values=OfficeSupplySalesData['Quantity'],

aggfunc='sum')

Visualizing the Sales data for the Office Supplies Category

Once we have filtered the data only for the Office Supplies category, you can observe slightly different trends in the data, and that makes sense because now you are focussing on the sales of one particular category.

import matplotlib.pyplot as plt
SalesQuantity=pd.crosstab(columns=OfficeSupplySalesData['Year'],
            index=OfficeSupplySalesData['Month'],
            values=OfficeSupplySalesData['Quantity'],
            aggfunc='sum').melt()['value']

MonthNames=['Jan','Feb','Mar','Apr','May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']*4

# Plotting the sales
SalesQuantity.plot(kind='line', figsize=(16,5), title='Total Sales Quantity per month for Office Supplies Category')
# Setting the x-axis labels
plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30)

import matplotlib.pyplot as plt

SalesQuantity=pd.crosstab(columns=OfficeSupplySalesData['Year'],

index=OfficeSupplySalesData['Month'],

values=OfficeSupplySalesData['Quantity'],

aggfunc='sum').melt()['value']

MonthNames=['Jan','Feb','Mar','Apr','May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']*4

# Plotting the sales

SalesQuantity.plot(kind='line', figsize=(16,5), title='Total Sales Quantity per month for Office Supplies Category')

# Setting the x-axis labels

plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30)

Time Series data for Office supplies Category

Taking Log of SalesQuantity

Taking a log is one of the top data transformations for continuous variables in Machine Learning while trying to improve the model fit, just make sure to check there are no zero values in the data before taking LOG. Here, taking the log of the Sales Quantity column will improve the model fit as well as the accuracy.

## Taking Log of the SalesQuantity
SalesQuantityLOG=np.log(SalesQuantity)

# Plotting the sales quantity after LOG transformation
SalesQuantityLOG.plot(kind='line', figsize=(16,5), title='Total Log(SalesQuantity) per month for Office Supplies Category')
# Setting the x-axis labels
plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30)

## Taking Log of the SalesQuantity

SalesQuantityLOG=np.log(SalesQuantity)

# Plotting the sales quantity after LOG transformation

SalesQuantityLOG.plot(kind='line', figsize=(16,5), title='Total Log(SalesQuantity) per month for Office Supplies Category')

# Setting the x-axis labels

plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30)

Notice the y-axis here, the values are now small due to the log transformation, however, the data pattern is exactly the same as it was before taking the log.

Finding best hyperparameters(p,d,q) for SARIMA model

Calling the same function that we defined above to find the best hyperparameters for SARIMA model of the Office supplies category.

# Calling the function defined above to get the best values
# This can take some time because there are multiple combinations!
# Cycle=12 because this is monthly data
ResultsData=FunctionTuneArima(inpData=SalesQuantityLOG,
                  p_values=[0,10], 
                  d_values=[0], 
                  q_values=[0,5],  
                  seasonal_p_values=[0,2],                                       
                  seasonal_d_values=[0], 
                  seasonal_q_values=[0,1,3],
                  cycle=12
                 )

# Calling the function defined above to get the best values

# This can take some time because there are multiple combinations!

# Cycle=12 because this is monthly data

ResultsData=FunctionTuneArima(inpData=SalesQuantityLOG,

p_values=[0,10],

d_values=[0],

q_values=[0,5],

seasonal_p_values=[0,2],

seasonal_d_values=[0],

seasonal_q_values=[0,1,3],

cycle=12

)

# Sorting the results to get the 10 best values
ResultsData.sort_values('Accuracy', ascending=False).head(10)

1 2	# Sorting the results to get the 10 best values ResultsData.sort_values('Accuracy', ascending=False).head(10)

Notice that the accuracy numbers are very high after the log transformation! However, it is important to understand this is not the final accuracy because these numbers were achieved on a transformed data. You will have to convert the data to original scale before computing accuracy as shown in the below code.

Creating the Time Series for Office Supplies Category

Creating Time Series only for the Office supplies Category using the best hyperparameters found in the above step as order = (10, 0, 5) and
seasonal_order =(2, 0, 3)). The cycle remains 12 as this is a monthly Time Series.

# Importing the algorithm
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
warnings.filterwarnings('ignore')

# Train the model on the full dataset 
SarimaxModel = model = SARIMAX(SalesQuantityLOG, 
                        order = (10, 0, 5),  
                        seasonal_order =(2, 0, 3, 12))

# Fitting the Time series model on the LOG scale data
SalesModel = SarimaxModel.fit(disp=False)
  
# Forecast for the next 6 months
FutureMonths=6
forecast = SalesModel.predict(start = 0,
                          end = (len(SalesQuantityLOG)) + FutureMonths,
                          typ = 'levels').rename('Forecast')

# Bringing the forecast to the original scale by taking antilog
forecast=np.exp(forecast)

print("Next Six Month Forecast:\n",forecast[-FutureMonths:])

# Plot the forecast values
# Using Original scale SalesQuantity for plot
SalesQuantity.plot(figsize = (18, 5), legend = True, title='Time Series Sales Forecasts')
forecast.plot(legend = True, figsize=(18,5))

# Measuring the Training accuracy of the model
# Using Original scale SalesQuantity for accuracy computation
MAPE=np.mean(abs(SalesQuantity-forecast)/SalesQuantity)*100
print('#### Accuracy of model:', round(100-MAPE,2), '####')

# Printing month names in X-Axis
PlotMonthNames=MonthNames+MonthNames[0:FutureMonths]
plotLabels=plt.xticks(np.arange(0,len(PlotMonthNames),1),PlotMonthNames, rotation=30)

# Importing the algorithm

from statsmodels.tsa.statespace.sarimax import SARIMAX

import warnings

warnings.filterwarnings('ignore')

# Train the model on the full dataset

SarimaxModel = model = SARIMAX(SalesQuantityLOG,

order = (10, 0, 5),

seasonal_order =(2, 0, 3, 12))

# Fitting the Time series model on the LOG scale data

SalesModel = SarimaxModel.fit(disp=False)

# Forecast for the next 6 months

FutureMonths=6

forecast = SalesModel.predict(start = 0,

end = (len(SalesQuantityLOG)) + FutureMonths,

typ = 'levels').rename('Forecast')

# Bringing the forecast to the original scale by taking antilog

forecast=np.exp(forecast)

print("Next Six Month Forecast:\n",forecast[-FutureMonths:])

# Plot the forecast values

# Using Original scale SalesQuantity for plot

SalesQuantity.plot(figsize = (18, 5), legend = True, title='Time Series Sales Forecasts')

forecast.plot(legend = True, figsize=(18,5))

# Measuring the Training accuracy of the model

# Using Original scale SalesQuantity for accuracy computation

MAPE=np.mean(abs(SalesQuantity-forecast)/SalesQuantity)*100

print('#### Accuracy of model:', round(100-MAPE,2), '####')

# Printing month names in X-Axis

PlotMonthNames=MonthNames+MonthNames[0:FutureMonths]

plotLabels=plt.xticks(np.arange(0,len(PlotMonthNames),1),PlotMonthNames, rotation=30)

Time Series forecast only for Office Supplies category

Conclusion

I hope this post taught you Time Series Modelling from a different perspective!

Keep in mind that choosing the right algorithm for a business problem is half the battle won! In this case study, we chose Time Series because we had the business problem of predicting the Sales Quantity for the next few months and in the given data most of the columns were Qualitative except for Sales and Quantity. Since Sales is derived using Quantity, hence we used Quantity as the variable to fit Time Series because its a RAW variable and its preferred over a derived variable. As a good exercise, you can try the same template using the Sales variable and see if the accuracy increases!

Thank you for your time on this website! If you found it useful, please share it with your friends to spread the knowledge!

Author Details

Farukh Hashmi

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!

https://thinkingneuron.com/

thinkingneuron@gmail.com

3 thoughts on “Time Series Forecasting: A Case Study in Python for Superstore Dataset”

Punam Saha
May 21, 2024 at 7:11 pm

Hello Faruk,
Can you please upload a video on Time series? t would really help to understand Time Series more accurately from your explanation.

1. Farukh Hashmi
  May 22, 2024 at 8:32 am
  
  Hi Punam,
  
  Sure, I will make this video soon.
  Thank you for the kind words!
  
Amol Rathod
July 13, 2024 at 8:53 am

Hi Sir,
I need help with interview preparation.

Installing the xlrd library to read data

Reading the Data

Looking at the data summary

Problem Statement

Observing the Quantitative, Qualitative and Categorical variables in data

Feature Engineering in data

Aggregating Sales Quantity for each month

Visualizing the Total sales Quantity per month

Introduction to the SARIMA Time Series algorithm

What are these (p,d,q) parameters in ARIMA time series?

How to find the best values of p,d,q?

Function to find best p,d,q parameters

For how many future months can I do the predictions using Time Series?

How can I improve the predictions?

Time Series only for the Office Supplies category

Visualizing the Sales data for the Office Supplies Category

Taking Log of SalesQuantity

Finding best hyperparameters(p,d,q) for SARIMA model

Creating the Time Series for Office Supplies Category

Conclusion

3 thoughts on “Time Series Forecasting: A Case Study in Python for Superstore Dataset”

Leave a Reply! Cancel Reply