Missing Values
TimeGPT
requires time series data that doesn’t have any missing values. It is possible to have multiple series that begin and end on different dates, but it is essential that each series contains uninterrupted data for its given time frame.
In this tutorial, we will show you how to deal with missing values in TimeGPT
.
Outline
This work is based on skforecast’s Forecasting Time Series with Missing Values tutorial.
Load Data
We will first load the data using pandas
. This dataset represents the daily number of bike rentals in a city. The column names are in Spanish, so we will rename them to ds
for the dates and y
for the number of bike rentals.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/master/data/usuarios_diarios_bicimad.csv')
df = df[['fecha', 'Usos bicis total día']] # select date and target variable
df.rename(columns={'fecha': 'ds', 'Usos bicis total día': 'y'}, inplace=True)
df.head()
ds | y | |
---|---|---|
0 | 2014-06-23 | 99 |
1 | 2014-06-24 | 72 |
2 | 2014-06-25 | 119 |
3 | 2014-06-26 | 135 |
4 | 2014-06-27 | 149 |
For convenience, we will convert the dates to timestamps and assign a unique id to the series. Although we only have one series in this example, when dealing with multiple series, it is necessary to assign a unique id to each one.
df['ds'] = pd.to_datetime(df['ds'])
df['unique_id'] = 'id1'
df = df[['unique_id', 'ds', 'y']]
Now we will separate the data in a training and a test set. We will use the last 93 days as the test set.
train_df = df[:-93]
test_df = df[-93:]
We will now introduce some missing values in the training set to demonstrate how to deal with them. This will be done as in the skforecast tutorial.
mask = ~((train_df['ds'] >= '2020-09-01') & (train_df['ds'] <= '2020-10-10')) & ~((train_df['ds'] >= '2020-11-08') & (train_df['ds'] <= '2020-12-15'))
train_df_gaps = train_df[mask]
Get Started with TimeGPT
Before proceeding, we will instantiate the NixtlaClient
class, which provides access to all the methods from TimeGPT
. To do this, you will need a Nixtla API key.
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
# defaults to os.environ.get("NIXTLA_API_KEY")
api_key = 'my_api_key_provided_by_nixtla'
)
Use an Azure AI endpoint
To use an Azure AI endpoint, set the
base_url
argument:
nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")
To learn more about how to set up your API key, please refer to the Setting Up Your API Key tutorial.
Visualize Data
We can visualize the data using the plot
method from the NixtlaClient
class. This method has an engine
argument that allows you to choose between different plotting libraries. Default is matplotlib
, but you can also use plotly
for interactive plots.
nixtla_client.plot(train_df_gaps)
Note that there are two gaps in the data: from September 1, 2020, to October 10, 2020, and from November 8, 2020, to December 15, 2020. To better visualize these gaps, you can use the max_insample_length
argument of the plot
method or you can simply zoom in on the plot.
nixtla_client.plot(train_df_gaps, max_insample_length=800)
Additionally, notice a period from March 16, 2020, to April 21, 2020, where the data shows zero rentals. These are not missing values, but actual zeros corresponding to the COVID-19 lockdown in the city.
Fill Missing Values
Before using TimeGPT
, we need to ensure that:
-
All timestamps from the start date to the end date are present in the data.
-
The target column contains no missing values.
To address the first issue, we will use the fill_gaps
function from utilsforecast
, a Python package from Nixtla that provides essential utilities for time series forecasting, such as functions for data preprocessing, plotting, and evaluation.
The fill_gaps
function will fill in the missing dates in the data. To do this, it requires the following arguments:
-
df
: The DataFrame containing the time series data. -
freq
(str or int): The frequency of the data.
from utilsforecast.preprocessing import fill_gaps
print('Number of rows before filling gaps:', len(train_df_gaps))
train_df_complete = fill_gaps(train_df_gaps, freq='D')
print('Number of rows after filling gaps:', len(train_df_complete))
Number of rows before filling gaps: 2851
Number of rows after filling gaps: 2929
NOTE:
In this tutorial, the data contains only one time series. However, TimeGPT supports passing multiple series to the model. In this case, none of the time series can have missing values from their individual earliest timestamp until their individual lastest timestamp. If these individual time series have missing values, the user must decide how to fill these gaps for the individual time series. The
fill_gaps
function provides a couple of additional arguments to assist with this (refer to the documentation for complete details), namelystart
andend
show_doc(fill_gaps)
fill_gaps
fill_gaps (df:~DFType, freq:Union[str,int],
start:Union[str,int,datetime.date,datetime.datetime]='per_seri
e',
end:Union[str,int,datetime.date,datetime.datetime]='global',
id_col:str='unique_id', time_col:str='ds')
Enforce start and end datetimes for dataframe.
Type | Default | Details | |
---|---|---|---|
df | DFType | Input data | |
freq | Union | Series’ frequency | |
start | Union | per_serie | Initial timestamp for the series. * ‘per_serie’ uses each serie’s first timestamp * ‘global’ uses the first timestamp seen in the data * Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1) |
end | Union | global | Initial timestamp for the series. * ‘per_serie’ uses each serie’s last timestamp * ‘global’ uses the last timestamp seen in the data * Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1) |
id_col | str | unique_id | Column that identifies each serie. |
time_col | str | ds | Column that identifies each timestamp. |
Returns | DFType | Dataframe with gaps filled. |
Now we need to decide how to fill the missing values in the target column. In this tutorial, we will use interpolation, but it is important to consider the specific context of your data when selecting a filling strategy. For example, if you are dealing with daily retail data, a missing value most likely indicates that there were no sales on that day, and you can fill it with zero. Conversely, if you are working with hourly temperature data, a missing value probably means that the sensor was not functioning, and you might prefer to use interpolation to fill the missing values.
train_df_complete['y'] = train_df_complete['y'].interpolate(method='linear', limit_direction='both')
train_df_complete.isna().sum() # check if there are any missing values
unique_id 0
ds 0
y 0
dtype: int64
Forecast with TimeGPT
We are now ready to use the forecast
method from the NixtlaClient
class. This method requires the following arguments:
-
df
: The DataFrame containing the time series data -
h
: (int) The forecast horizon. In this case, it is 93 days. -
model
(str): The model to use. Default istimegpt-1
, but since the forecast horizon exceeds the frequency of the data (daily), we will usetimegpt-1-long-horizon
. To learn more about this, please refer to the Forecasting on a Long Horizon tutorial.
fcst = nixtla_client.forecast(train_df_complete, h=len(test_df), model='timegpt-1-long-horizon')
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Inferred freq: D
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Querying model metadata...
WARNING:nixtla.nixtla_client:The specified horizon "h" exceeds the model horizon, this may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
Available models in Azure AI
If you are using an Azure AI endpoint, please be sure to set
model="azureai"
:
nixtla_client.forecast(..., model="azureai")
For the public API, we support two models:
timegpt-1
andtimegpt-1-long-horizon
.By default,
timegpt-1
is used. Please see this tutorial on how and when to usetimegpt-1-long-horizon
.
We can use the plot
method to visualize the TimeGPT
forecast and the test set.
nixtla_client.plot(test_df, fcst)
Next, we will use the evaluate
function from utilsforecast
to compute the Mean Average Error (MAE) of the TimeGPT forecast. Before proceeding, we need to convert the dates in the forecast to timestamps so we can merge them with the test set.
The evaluate
function requires the following arguments:
-
df
: The DataFrame containing the forecast and the actual values (in they
column). -
metrics
(list): The metrics to be computed.
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mae
fcst['ds'] = pd.to_datetime(fcst['ds'])
result = test_df.merge(fcst, on=['ds', 'unique_id'], how='left')
result.head()
unique_id | ds | y | TimeGPT | |
---|---|---|---|---|
0 | id1 | 2022-06-30 | 13468 | 13357.357 |
1 | id1 | 2022-07-01 | 12932 | 12390.052 |
2 | id1 | 2022-07-02 | 9918 | 9778.649 |
3 | id1 | 2022-07-03 | 8967 | 8846.637 |
4 | id1 | 2022-07-04 | 12869 | 11589.071 |
evaluate(result, metrics=[mae])
unique_id | metric | TimeGPT | |
---|---|---|---|
0 | id1 | mae | 1824.693059 |
Important Considerations
The key takeaway from this tutorial is that TimeGPT
requires time series data without missing values. This means that:
-
Given the frequency of the data, the timestamps must be continuous, with no gaps between the start and end dates.
-
The data must not contain missing values (NaNs).
We also showed that utilsforecast
provides a convenient function to fill missing dates and that you need to decide how to address the missing values. This decision depends on the context of your data, so be mindful when selecting a filling strategy, and choose the one you think best reflects reality.
Finally, we also demonstrated that utilsforecast
provides a function to evaluate the TimeGPT
forecast using common accuracy metrics.
References
Updated 19 days ago