Forecasting web traffic

Our task is to forecast the next 7 days of daily visits to the website cienciadedatos.net.

In this tutorial we will show:

  • How to load time series data to be used for forecasting with TimeGPT

  • How to create cross-validated forecasts with TimeGPT

This tutorial is an adaptation from Joaquín Amat Rodrigo, Javier Escobar Ortiz, “Forecasting web traffic with machine learning and Python”. We will show you:

  • how you can achieve almost 10% better forecasting results;

  • using significantly less lines of code;

  • in a fraction of the time needed to run the original tutorial.

1. Import packages

First, we import the required packages and initialize the Nixtla client.

import pandas as pd
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)

👍

Use an Azure AI endpoint

To use an Azure AI endpoint, remember to set also the base_url argument:

nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")

2. Load data

We load the website visit data, and set it to the right format to use with TimeGPT. In this case, we only need to add an identifier column for the timeseries, which we will call daily_visits.

url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/' +
       'master/data/visitas_por_dia_web_cienciadedatos.csv')
df = pd.read_csv(url, sep=',', parse_dates=[0], date_format='%d/%m/%y')
df['unique_id'] = 'daily_visits'

df.head(10)
dateusersunique_id
02020-07-012324daily_visits
12020-07-022201daily_visits
22020-07-032146daily_visits
32020-07-041666daily_visits
42020-07-051433daily_visits
52020-07-062195daily_visits
62020-07-072240daily_visits
72020-07-082295daily_visits
82020-07-092279daily_visits
92020-07-102155daily_visits

That’s it! No more preprocessing is necessary.

3. Cross-validation with TimeGPT

We can perform cross-validation on our data as follows:

timegpt_cv_df = nixtla_client.cross_validation(
    df, 
    h=7, 
    n_windows=8, 
    time_col='date', 
    target_col='users', 
    freq='D',
    level=[80, 90, 99.5]
)
timegpt_cv_df.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
unique_iddatecutoffusersTimeGPTTimeGPT-lo-99.5TimeGPT-lo-90TimeGPT-lo-80TimeGPT-hi-80TimeGPT-hi-90TimeGPT-hi-99.5
0daily_visits2021-07-012021-06-3031233310.9084473041.9254973048.3632203082.7219243539.0949713573.4536743579.891397
1daily_visits2021-07-022021-06-3028703090.9716802793.5359052838.4802982853.7504883328.1928713343.4630623388.407455
2daily_visits2021-07-032021-06-3020202346.9914552043.7312962150.0050782171.1870122522.7958982543.9778322650.251614
3daily_visits2021-07-042021-06-3018282182.1918951836.8481731897.6849001929.9145752434.4692142466.6988892527.535616
4daily_visits2021-07-052021-06-3027223082.7150882736.0080552746.9970342791.3753423374.0548343418.4331423429.422121

📘

Available models in Azure AI

If you are using an Azure AI endpoint, please be sure to set model="azureai":

nixtla_client.cross_validation(..., model="azureai")

For the public API, we support two models: timegpt-1 and timegpt-1-long-horizon.

By default, timegpt-1 is used. Please see this tutorial on how and when to use timegpt-1-long-horizon.

Here, we have performed a rolling cross-validation of 8 folds. Let’s plot the cross-validated forecasts including the prediction intervals:

nixtla_client.plot(
    df, 
    timegpt_cv_df.drop(columns=['cutoff', 'users']), 
    time_col='date',
    target_col='users',
    max_insample_length=90, 
    level=[80, 90, 99.5]
)

This looks reasonable, and very comparable to the results obtained here.

Let’s check the Mean Absolute Error of our cross-validation:

from utilsforecast.losses import mae
mae_timegpt = mae(df = timegpt_cv_df.drop(columns=['cutoff']),
    models=['TimeGPT'],
    target_col='users')

mae_timegpt
unique_idTimeGPT
0daily_visits167.691711

The MAE of our backtest is 167.69. Hence, not only did TimeGPT achieve a lower MAE compared to the fully customized pipeline here, the error of the forecast is also lower.

Exogenous variables

Now let’s add some exogenous variables to see if we can improve the forecasting performance further.

We will add weekday indicators, which we will extract from the date column.

# We have 7 days, for each day a separate column denoting 1/0
for i in range(7):
    df[f'week_day_{i + 1}'] = 1 * (df['date'].dt.weekday == i)

df.head(10)
dateusersunique_idweek_day_1week_day_2week_day_3week_day_4week_day_5week_day_6week_day_7
02020-07-012324daily_visits0010000
12020-07-022201daily_visits0001000
22020-07-032146daily_visits0000100
32020-07-041666daily_visits0000010
42020-07-051433daily_visits0000001
52020-07-062195daily_visits1000000
62020-07-072240daily_visits0100000
72020-07-082295daily_visits0010000
82020-07-092279daily_visits0001000
92020-07-102155daily_visits0000100

Let’s rerun the cross-validation procedure with the added exogenous variables.

timegpt_cv_df_with_ex = nixtla_client.cross_validation(
    df, 
    h=7, 
    n_windows=8, 
    time_col='date', 
    target_col='users', 
    freq='D',
    level=[80, 90, 99.5]
)
timegpt_cv_df_with_ex.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
unique_iddatecutoffusersTimeGPTTimeGPT-lo-99.5TimeGPT-lo-90TimeGPT-lo-80TimeGPT-hi-80TimeGPT-hi-90TimeGPT-hi-99.5
0daily_visits2021-07-012021-06-3031233314.7737432793.5669423043.3042613085.6681223543.8793643586.2432263835.980544
1daily_visits2021-07-022021-06-3028703093.0665292139.7278922725.9641122779.0821543407.0509043460.1689464046.405166
2daily_visits2021-07-032021-06-3020202347.9735731386.0905291915.4875501973.6796282722.2675192780.4595963309.856618
3daily_visits2021-07-042021-06-3018282182.4674081003.6774541681.2464911874.5723272490.3624882683.6883243361.257361
4daily_visits2021-07-052021-06-3027223083.6294531257.2484352220.4303572556.4086283610.8502793946.8285504910.010472

Let’s plot our forecasts again and calculate our error.

nixtla_client.plot(
    df, 
    timegpt_cv_df_with_ex.drop(columns=['cutoff', 'users']), 
    time_col='date',
    target_col='users',
    max_insample_length=90, 
    level=[80, 90, 99.5]
)

mae_timegpt_with_exogenous = mae(df = timegpt_cv_df_with_ex.drop(columns=['cutoff']),
    models=['TimeGPT'],
    target_col='users')

mae_timegpt_with_exogenous
unique_idTimeGPT
0daily_visits167.22857

To conclude, we obtain the following forecast results in this notebook:

mae_timegpt['Exogenous features'] = False
mae_timegpt_with_exogenous['Exogenous features'] = True

df_results = pd.concat([mae_timegpt, mae_timegpt_with_exogenous])
df_results = df_results.rename(columns={'TimeGPT':'MAE backtest'})
df_results = df_results.drop(columns={'unique_id'})
df_results['model'] = 'TimeGPT'

df_results[['model', 'Exogenous features', 'MAE backtest']]
modelExogenous featuresMAE backtest
0TimeGPTFalse167.691711
0TimeGPTTrue167.228570

We’ve shown how to forecast daily visits of a website. We achieved almost 10% better forecasting results as compared to the original tutorial, using significantly less lines of code, in a fraction of the time required to run everything.

Did you notice how little effort that took? What you did not have to do, is:

  • Elaborate data preprocessing - just a table with timeseries is sufficient
  • Creating a validation- and test set - TimeGPT handles the cross-validation in a single function
  • Choosing and testing different models - It’s just a single call to TimeGPT
  • Hyperparameter tuning - Not necessary.

Happy forecasting!