Missing Values

TimeGPT requires time series data that doesn’t have any missing values. It is possible to have multiple series that begin and end on different dates, but it is essential that each series contains uninterrupted data for its given time frame.

In this tutorial, we will show you how to deal with missing values in TimeGPT.

Outline

  1. Load Data

  2. Get Started with TimeGPT

  3. Visualize Data

  4. Fill Missing Values

  5. Forecast with TimeGPT

  6. Important Considerations

  7. References

This work is based on skforecast’s Forecasting Time Series with Missing Values tutorial.

Load Data

We will first load the data using pandas. This dataset represents the daily number of bike rentals in a city. The column names are in Spanish, so we will rename them to ds for the dates and y for the number of bike rentals.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/master/data/usuarios_diarios_bicimad.csv')
df = df[['fecha', 'Usos bicis total día']] # select date and target variable 
df.rename(columns={'fecha': 'ds', 'Usos bicis total día': 'y'}, inplace=True) 
df.head()
dsy
02014-06-2399
12014-06-2472
22014-06-25119
32014-06-26135
42014-06-27149

For convenience, we will convert the dates to timestamps and assign a unique id to the series. Although we only have one series in this example, when dealing with multiple series, it is necessary to assign a unique id to each one.

df['ds'] = pd.to_datetime(df['ds']) 
df['unique_id'] = 'id1'
df = df[['unique_id', 'ds', 'y']]

Now we will separate the data in a training and a test set. We will use the last 93 days as the test set.

train_df = df[:-93] 
test_df = df[-93:] 

We will now introduce some missing values in the training set to demonstrate how to deal with them. This will be done as in the skforecast tutorial.

mask = ~((train_df['ds'] >= '2020-09-01') & (train_df['ds'] <= '2020-10-10')) &  ~((train_df['ds'] >= '2020-11-08') & (train_df['ds'] <= '2020-12-15'))

train_df_gaps = train_df[mask]

Get Started with TimeGPT

Before proceeding, we will instantiate the NixtlaClient class, which provides access to all the methods from TimeGPT. To do this, you will need a Nixtla API key.

from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)

👍

Use an Azure AI endpoint

To use an Azure AI endpoint, set the base_url argument:

nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")

To learn more about how to set up your API key, please refer to the Setting Up Your API Key tutorial.

Visualize Data

We can visualize the data using the plot method from the NixtlaClient class. This method has an engine argument that allows you to choose between different plotting libraries. Default is matplotlib, but you can also use plotly for interactive plots.

nixtla_client.plot(train_df_gaps)

Note that there are two gaps in the data: from September 1, 2020, to October 10, 2020, and from November 8, 2020, to December 15, 2020. To better visualize these gaps, you can use the max_insample_length argument of the plot method or you can simply zoom in on the plot.

nixtla_client.plot(train_df_gaps, max_insample_length=800)

Additionally, notice a period from March 16, 2020, to April 21, 2020, where the data shows zero rentals. These are not missing values, but actual zeros corresponding to the COVID-19 lockdown in the city.

Fill Missing Values

Before using TimeGPT, we need to ensure that:

  1. All timestamps from the start date to the end date are present in the data.

  2. The target column contains no missing values.

To address the first issue, we will use the fill_gaps function from utilsforecast, a Python package from Nixtla that provides essential utilities for time series forecasting, such as functions for data preprocessing, plotting, and evaluation.

The fill_gaps function will fill in the missing dates in the data. To do this, it requires the following arguments:

  • df: The DataFrame containing the time series data.

  • freq (str or int): The frequency of the data.

from utilsforecast.preprocessing import fill_gaps
print('Number of rows before filling gaps:', len(train_df_gaps))
train_df_complete = fill_gaps(train_df_gaps, freq='D')
print('Number of rows after filling gaps:', len(train_df_complete))
Number of rows before filling gaps: 2851
Number of rows after filling gaps: 2929

NOTE:

In this tutorial, the data contains only one time series. However, TimeGPT supports passing multiple series to the model. In this case, none of the time series can have missing values from their individual earliest timestamp until their individual lastest timestamp. If these individual time series have missing values, the user must decide how to fill these gaps for the individual time series. The fill_gaps function provides a couple of additional arguments to assist with this (refer to the documentation for complete details), namely start and end

show_doc(fill_gaps)

source

fill_gaps

fill_gaps (df:~DFType, freq:Union[str,int],
start:Union[str,int,datetime.date,datetime.datetime]='per_seri
e',
end:Union[str,int,datetime.date,datetime.datetime]='global',
id_col:str='unique_id', time_col:str='ds')

Enforce start and end datetimes for dataframe.

TypeDefaultDetails
dfDFTypeInput data
freqUnionSeries’ frequency
startUnionper_serieInitial timestamp for the series.
* ‘per_serie’ uses each serie’s first timestamp
* ‘global’ uses the first timestamp seen in the data
* Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1)
endUnionglobalInitial timestamp for the series.
* ‘per_serie’ uses each serie’s last timestamp
* ‘global’ uses the last timestamp seen in the data
* Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1)
id_colstrunique_idColumn that identifies each serie.
time_colstrdsColumn that identifies each timestamp.
ReturnsDFTypeDataframe with gaps filled.

Now we need to decide how to fill the missing values in the target column. In this tutorial, we will use interpolation, but it is important to consider the specific context of your data when selecting a filling strategy. For example, if you are dealing with daily retail data, a missing value most likely indicates that there were no sales on that day, and you can fill it with zero. Conversely, if you are working with hourly temperature data, a missing value probably means that the sensor was not functioning, and you might prefer to use interpolation to fill the missing values.

train_df_complete['y'] = train_df_complete['y'].interpolate(method='linear', limit_direction='both')

train_df_complete.isna().sum() # check if there are any missing values
unique_id    0
ds           0
y            0
dtype: int64

Forecast with TimeGPT

We are now ready to use the forecast method from the NixtlaClient class. This method requires the following arguments:

  • df: The DataFrame containing the time series data

  • h: (int) The forecast horizon. In this case, it is 93 days.

  • model (str): The model to use. Default is timegpt-1, but since the forecast horizon exceeds the frequency of the data (daily), we will use timegpt-1-long-horizon. To learn more about this, please refer to the Forecasting on a Long Horizon tutorial.

fcst = nixtla_client.forecast(train_df_complete, h=len(test_df), model='timegpt-1-long-horizon')
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Inferred freq: D
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Querying model metadata...
WARNING:nixtla.nixtla_client:The specified horizon "h" exceeds the model horizon, this may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...

📘

Available models in Azure AI

If you are using an Azure AI endpoint, please be sure to set model="azureai":

nixtla_client.forecast(..., model="azureai")

For the public API, we support two models: timegpt-1 and timegpt-1-long-horizon.

By default, timegpt-1 is used. Please see this tutorial on how and when to use timegpt-1-long-horizon.

We can use the plot method to visualize the TimeGPT forecast and the test set.

nixtla_client.plot(test_df, fcst)

Next, we will use the evaluate function from utilsforecast to compute the Mean Average Error (MAE) of the TimeGPT forecast. Before proceeding, we need to convert the dates in the forecast to timestamps so we can merge them with the test set.

The evaluate function requires the following arguments:

  • df: The DataFrame containing the forecast and the actual values (in the y column).

  • metrics (list): The metrics to be computed.

from utilsforecast.evaluation import evaluate 
from utilsforecast.losses import mae 
fcst['ds'] = pd.to_datetime(fcst['ds'])

result = test_df.merge(fcst, on=['ds', 'unique_id'], how='left')
result.head()
unique_iddsyTimeGPT
0id12022-06-301346813357.357
1id12022-07-011293212390.052
2id12022-07-0299189778.649
3id12022-07-0389678846.637
4id12022-07-041286911589.071
evaluate(result, metrics=[mae])
unique_idmetricTimeGPT
0id1mae1824.693059

Important Considerations

The key takeaway from this tutorial is that TimeGPT requires time series data without missing values. This means that:

  1. Given the frequency of the data, the timestamps must be continuous, with no gaps between the start and end dates.

  2. The data must not contain missing values (NaNs).

We also showed that utilsforecast provides a convenient function to fill missing dates and that you need to decide how to address the missing values. This decision depends on the context of your data, so be mindful when selecting a filling strategy, and choose the one you think best reflects reality.

Finally, we also demonstrated that utilsforecast provides a function to evaluate the TimeGPT forecast using common accuracy metrics.

References