Categorical variables

Categorical variables are external factors that can influence a forecast. These variables take on one of a limited, fixed number of possible values, and induce a grouping of your observations.

For example, if you’re forecasting daily product demand for a retailer, you could benefit from an event variable that may tell you what kind of event takes place on a given day, for example ‘None’, ‘Sporting’, or ‘Cultural’.

To incorporate categorical variables in TimeGPT, you’ll need to pair each point in your time series data with the corresponding external data.

1. Import packages

First, we install and import the required packages and initialize the Nixtla client.

import pandas as pd
import os

from nixtla import NixtlaClient
from datasetsforecast.m5 import M5
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'   
)

👍

Use an Azure AI endpoint

To use an Azure AI endpoint, remember to set also the base_url argument:

nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")

2. Load M5 data

Let’s see an example on predicting sales of products of the M5 dataset. The M5 dataset contains daily product demand (sales) for 10 retail stores in the US.

First, we load the data using datasetsforecast. This returns:

  • Y_df, containing the sales (y column), for each unique product (unique_id column) at every timestamp (ds column).
  • X_df, containing additional relevant information for each unique product (unique_id column) at every timestamp (ds column).
Y_df, X_df, _ = M5.load(directory=os.getcwd())
Y_df['ds'] = pd.to_datetime(Y_df['ds'])
X_df['ds'] = pd.to_datetime(X_df['ds'])
Y_df.head(10)
unique_iddsy
0FOODS_1_001_CA_12011-01-293.0
1FOODS_1_001_CA_12011-01-300.0
2FOODS_1_001_CA_12011-01-310.0
3FOODS_1_001_CA_12011-02-011.0
4FOODS_1_001_CA_12011-02-024.0
5FOODS_1_001_CA_12011-02-032.0
6FOODS_1_001_CA_12011-02-040.0
7FOODS_1_001_CA_12011-02-052.0
8FOODS_1_001_CA_12011-02-060.0
9FOODS_1_001_CA_12011-02-070.0

For this example, we will only keep the additional relevant information from the column event_type_1. This column is a categorical variable that indicates whether an important event that might affect the sales of the product takes place at a certain date.

X_df = X_df[['unique_id', 'ds', 'event_type_1']]

X_df.head(10)
unique_iddsevent_type_1
0FOODS_1_001_CA_12011-01-29nan
1FOODS_1_001_CA_12011-01-30nan
2FOODS_1_001_CA_12011-01-31nan
3FOODS_1_001_CA_12011-02-01nan
4FOODS_1_001_CA_12011-02-02nan
5FOODS_1_001_CA_12011-02-03nan
6FOODS_1_001_CA_12011-02-04nan
7FOODS_1_001_CA_12011-02-05nan
8FOODS_1_001_CA_12011-02-06Sporting
9FOODS_1_001_CA_12011-02-07nan

As you can see, on February 6th 2011, there is a Sporting event.

3. Forecasting product demand using categorical variables

We will forecast the demand for a single product only. We choose a high selling food product identified by FOODS_3_090_CA_3.

product = 'FOODS_3_090_CA_3'
Y_df_product = Y_df.query('unique_id == @product')
X_df_product = X_df.query('unique_id == @product')

We merge our two dataframes to create the dataset to be used in TimeGPT.

df = Y_df_product.merge(X_df_product)

df.head(10)
unique_iddsyevent_type_1
0FOODS_3_090_CA_32011-01-29108.0nan
1FOODS_3_090_CA_32011-01-30132.0nan
2FOODS_3_090_CA_32011-01-31102.0nan
3FOODS_3_090_CA_32011-02-01120.0nan
4FOODS_3_090_CA_32011-02-02106.0nan
5FOODS_3_090_CA_32011-02-03123.0nan
6FOODS_3_090_CA_32011-02-04279.0nan
7FOODS_3_090_CA_32011-02-05175.0nan
8FOODS_3_090_CA_32011-02-06186.0Sporting
9FOODS_3_090_CA_32011-02-07120.0nan

In order to use categorical variables with TimeGPT, it is necessary to numerically encode the variables. We will use one-hot encoding in this tutorial.

We can one-hot encode the event_type_1 column by using pandas built-in get_dummies functionality. After one-hot encoding the event_type_1 variable, we can add it to the dataframe and remove the original column.

event_type_1_ohe = pd.get_dummies(df['event_type_1'], dtype=int)
df = pd.concat([df, event_type_1_ohe], axis=1)
df = df.drop(columns = 'event_type_1')

df.tail(10)
unique_iddsyCulturalNationalReligiousSportingnan
1959FOODS_3_090_CA_32016-06-10140.000001
1960FOODS_3_090_CA_32016-06-11151.000001
1961FOODS_3_090_CA_32016-06-1287.000001
1962FOODS_3_090_CA_32016-06-1367.000001
1963FOODS_3_090_CA_32016-06-1450.000001
1964FOODS_3_090_CA_32016-06-1558.000001
1965FOODS_3_090_CA_32016-06-16116.000001
1966FOODS_3_090_CA_32016-06-17124.000001
1967FOODS_3_090_CA_32016-06-18167.000001
1968FOODS_3_090_CA_32016-06-19118.000010

As you can see, we have now added 5 columns, each with a binary indicator (1 or 0) whether there is a Cultural, National, Religious, Sporting or no (nan) event on that particular day. For example, on June 19th 2016, there is a Sporting event.

Let’s turn to our forecasting task. We will forecast the first 7 days of February 2016. This includes 7 February 2016 - the date on which Super Bowl 50 was held. Such large, national events typically impact retail product sales.

To use the encoded categorical variables in TimeGPT, we have to add them as future values. Therefore, we create a future values dataframe, that contains the unique_id, the timestamp ds, and the encoded categorical variables.

Of course, we drop the target column as this is normally not available - this is the quantity that we seek to forecast!

future_ex_vars_df = df.drop(columns = ['y'])
future_ex_vars_df = future_ex_vars_df.query("ds >= '2016-02-01' & ds <= '2016-02-07'")

future_ex_vars_df.head(10)
unique_iddsCulturalNationalReligiousSportingnan
1829FOODS_3_090_CA_32016-02-0100001
1830FOODS_3_090_CA_32016-02-0200001
1831FOODS_3_090_CA_32016-02-0300001
1832FOODS_3_090_CA_32016-02-0400001
1833FOODS_3_090_CA_32016-02-0500001
1834FOODS_3_090_CA_32016-02-0600001
1835FOODS_3_090_CA_32016-02-0700010

Next, we limit our input dataframe to all but the 7 forecast days:

df_train = df.query("ds < '2016-02-01'")

df_train.tail(10)
unique_iddsyCulturalNationalReligiousSportingnan
1819FOODS_3_090_CA_32016-01-2294.000001
1820FOODS_3_090_CA_32016-01-23144.000001
1821FOODS_3_090_CA_32016-01-24146.000001
1822FOODS_3_090_CA_32016-01-2587.000001
1823FOODS_3_090_CA_32016-01-2673.000001
1824FOODS_3_090_CA_32016-01-2762.000001
1825FOODS_3_090_CA_32016-01-2864.000001
1826FOODS_3_090_CA_32016-01-29102.000001
1827FOODS_3_090_CA_32016-01-30113.000001
1828FOODS_3_090_CA_32016-01-3198.000001

Let’s call the forecast method, first without the categorical variables.

timegpt_fcst_without_cat_vars_df = nixtla_client.forecast(df=df_train, h=7, level=[80, 90])
timegpt_fcst_without_cat_vars_df.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Inferred freq: D
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
unique_iddsTimeGPTTimeGPT-lo-90TimeGPT-lo-80TimeGPT-hi-80TimeGPT-hi-90
0FOODS_3_090_CA_32016-02-0173.30409253.44904954.79507891.81310793.159136
1FOODS_3_090_CA_32016-02-0266.33551847.51066950.27413682.39689985.160367
2FOODS_3_090_CA_32016-02-0365.88163036.21861741.38889690.37436495.544643
3FOODS_3_090_CA_32016-02-0472.371864-26.68311525.097362119.646367171.426844
4FOODS_3_090_CA_32016-02-0595.141045-2.08488234.027078156.255011192.366971

📘

Available models in Azure AI

If you are using an Azure AI endpoint, please be sure to set model="azureai":

nixtla_client.forecast(..., model="azureai")

For the public API, we support two models: timegpt-1 and timegpt-1-long-horizon.

By default, timegpt-1 is used. Please see this tutorial on how and when to use timegpt-1-long-horizon.

We plot the forecast and the last 28 days before the forecast period:

nixtla_client.plot(
    df[['unique_id', 'ds', 'y']].query("ds <= '2016-02-07'"), 
    timegpt_fcst_without_cat_vars_df, 
    max_insample_length=28, 
)

TimeGPT already provides a reasonable forecast, but it seems to somewhat underforecast the peak on the 6th of February 2016 - the day before the Super Bowl.

Let’s call the forecast method again, now with the categorical variables.

timegpt_fcst_with_cat_vars_df = nixtla_client.forecast(df=df_train, X_df=future_ex_vars_df, h=7, level=[80, 90])
timegpt_fcst_with_cat_vars_df.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Inferred freq: D
INFO:nixtla.nixtla_client:Using the following exogenous variables: Cultural, National, Religious, Sporting, nan
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
unique_iddsTimeGPTTimeGPT-lo-90TimeGPT-lo-80TimeGPT-hi-80TimeGPT-hi-90
0FOODS_3_090_CA_32016-02-0170.661271-0.20437814.593348126.729194141.526919
1FOODS_3_090_CA_32016-02-0265.566941-20.39432611.654239119.479643151.528208
2FOODS_3_090_CA_32016-02-0368.510010-33.7137106.732952130.287069170.733731
3FOODS_3_090_CA_32016-02-0475.417710-40.9746494.751767146.083653191.810069
4FOODS_3_090_CA_32016-02-0597.340302-57.38536118.253812176.426792252.065965

📘

Available models in Azure AI

If you are using an Azure AI endpoint, please be sure to set model="azureai":

nixtla_client.forecast(..., model="azureai")

For the public API, we support two models: timegpt-1 and timegpt-1-long-horizon.

By default, timegpt-1 is used. Please see this tutorial on how and when to use timegpt-1-long-horizon.

We plot the forecast and the last 28 days before the forecast period:

nixtla_client.plot(
    df[['unique_id', 'ds', 'y']].query("ds <= '2016-02-07'"), 
    timegpt_fcst_with_cat_vars_df, 
    max_insample_length=28, 
)

We can visually verify that the forecast is closer to the actual observed value, which is the result of including the categorical variable in our forecast.

Let’s verify this conclusion by computing the Mean Absolute Error on the forecasts we created.

from utilsforecast.losses import mae
# Create target dataframe
df_target = df[['unique_id', 'ds', 'y']].query("ds >= '2016-02-01' & ds <= '2016-02-07'")

# Rename forecast columns
timegpt_fcst_without_cat_vars_df = timegpt_fcst_without_cat_vars_df.rename(columns={'TimeGPT': 'TimeGPT-without-cat-vars'})
timegpt_fcst_with_cat_vars_df = timegpt_fcst_with_cat_vars_df.rename(columns={'TimeGPT': 'TimeGPT-with-cat-vars'})

# Merge forecasts with target dataframe
df_target = df_target.merge(timegpt_fcst_without_cat_vars_df[['unique_id', 'ds', 'TimeGPT-without-cat-vars']])
df_target = df_target.merge(timegpt_fcst_with_cat_vars_df[['unique_id', 'ds', 'TimeGPT-with-cat-vars']])

# Compute errors
mean_absolute_errors = mae(df_target, ['TimeGPT-without-cat-vars', 'TimeGPT-with-cat-vars'])
mean_absolute_errors
unique_idTimeGPT-without-cat-varsTimeGPT-with-cat-vars
0FOODS_3_090_CA_324.28564920.028514

Indeed, we find that the error when using TimeGPT with the categorical variable is approx. 20% lower than when using TimeGPT without the categorical variables, indicating better performance when we include the categorical variable.