Forecasting web traffic
Our task is to forecast the next 7 days of daily visits to the website cienciadedatos.net.
In this tutorial we will show:
-
How to load time series data to be used for forecasting with TimeGPT
-
How to create cross-validated forecasts with TimeGPT
This tutorial is an adaptation from Joaquín Amat Rodrigo, Javier Escobar Ortiz, “Forecasting web traffic with machine learning and Python”. We will show you:
-
how you can achieve almost 10% better forecasting results;
-
using significantly less lines of code;
-
in a fraction of the time needed to run the original tutorial.
1. Import packages
First, we import the required packages and initialize the Nixtla client.
import pandas as pd
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
# defaults to os.environ.get("NIXTLA_API_KEY")
api_key = 'my_api_key_provided_by_nixtla'
)
Use an Azure AI endpoint
To use an Azure AI endpoint, remember to set also the
base_url
argument:
nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")
2. Load data
We load the website visit data, and set it to the right format to use with TimeGPT. In this case, we only need to add an identifier column for the timeseries, which we will call daily_visits
.
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/' +
'master/data/visitas_por_dia_web_cienciadedatos.csv')
df = pd.read_csv(url, sep=',', parse_dates=[0], date_format='%d/%m/%y')
df['unique_id'] = 'daily_visits'
df.head(10)
date | users | unique_id | |
---|---|---|---|
0 | 2020-07-01 | 2324 | daily_visits |
1 | 2020-07-02 | 2201 | daily_visits |
2 | 2020-07-03 | 2146 | daily_visits |
3 | 2020-07-04 | 1666 | daily_visits |
4 | 2020-07-05 | 1433 | daily_visits |
5 | 2020-07-06 | 2195 | daily_visits |
6 | 2020-07-07 | 2240 | daily_visits |
7 | 2020-07-08 | 2295 | daily_visits |
8 | 2020-07-09 | 2279 | daily_visits |
9 | 2020-07-10 | 2155 | daily_visits |
That’s it! No more preprocessing is necessary.
3. Cross-validation with TimeGPT
We can perform cross-validation on our data as follows:
timegpt_cv_df = nixtla_client.cross_validation(
df,
h=7,
n_windows=8,
time_col='date',
target_col='users',
freq='D',
level=[80, 90, 99.5]
)
timegpt_cv_df.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
unique_id | date | cutoff | users | TimeGPT | TimeGPT-lo-99.5 | TimeGPT-lo-90 | TimeGPT-lo-80 | TimeGPT-hi-80 | TimeGPT-hi-90 | TimeGPT-hi-99.5 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | daily_visits | 2021-07-01 | 2021-06-30 | 3123 | 3310.908447 | 3041.925497 | 3048.363220 | 3082.721924 | 3539.094971 | 3573.453674 | 3579.891397 |
1 | daily_visits | 2021-07-02 | 2021-06-30 | 2870 | 3090.971680 | 2793.535905 | 2838.480298 | 2853.750488 | 3328.192871 | 3343.463062 | 3388.407455 |
2 | daily_visits | 2021-07-03 | 2021-06-30 | 2020 | 2346.991455 | 2043.731296 | 2150.005078 | 2171.187012 | 2522.795898 | 2543.977832 | 2650.251614 |
3 | daily_visits | 2021-07-04 | 2021-06-30 | 1828 | 2182.191895 | 1836.848173 | 1897.684900 | 1929.914575 | 2434.469214 | 2466.698889 | 2527.535616 |
4 | daily_visits | 2021-07-05 | 2021-06-30 | 2722 | 3082.715088 | 2736.008055 | 2746.997034 | 2791.375342 | 3374.054834 | 3418.433142 | 3429.422121 |
Available models in Azure AI
If you are using an Azure AI endpoint, please be sure to set
model="azureai"
:
nixtla_client.cross_validation(..., model="azureai")
For the public API, we support two models:
timegpt-1
andtimegpt-1-long-horizon
.By default,
timegpt-1
is used. Please see this tutorial on how and when to usetimegpt-1-long-horizon
.
Here, we have performed a rolling cross-validation of 8 folds. Let’s plot the cross-validated forecasts including the prediction intervals:
nixtla_client.plot(
df,
timegpt_cv_df.drop(columns=['cutoff', 'users']),
time_col='date',
target_col='users',
max_insample_length=90,
level=[80, 90, 99.5]
)
This looks reasonable, and very comparable to the results obtained here.
Let’s check the Mean Absolute Error of our cross-validation:
from utilsforecast.losses import mae
mae_timegpt = mae(df = timegpt_cv_df.drop(columns=['cutoff']),
models=['TimeGPT'],
target_col='users')
mae_timegpt
unique_id | TimeGPT | |
---|---|---|
0 | daily_visits | 167.691711 |
The MAE of our backtest is 167.69
. Hence, not only did TimeGPT achieve a lower MAE compared to the fully customized pipeline here, the error of the forecast is also lower.
Exogenous variables
Now let’s add some exogenous variables to see if we can improve the forecasting performance further.
We will add weekday indicators, which we will extract from the date
column.
# We have 7 days, for each day a separate column denoting 1/0
for i in range(7):
df[f'week_day_{i + 1}'] = 1 * (df['date'].dt.weekday == i)
df.head(10)
date | users | unique_id | week_day_1 | week_day_2 | week_day_3 | week_day_4 | week_day_5 | week_day_6 | week_day_7 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-07-01 | 2324 | daily_visits | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 2020-07-02 | 2201 | daily_visits | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 2020-07-03 | 2146 | daily_visits | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 2020-07-04 | 1666 | daily_visits | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 2020-07-05 | 1433 | daily_visits | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 | 2020-07-06 | 2195 | daily_visits | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 2020-07-07 | 2240 | daily_visits | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
7 | 2020-07-08 | 2295 | daily_visits | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8 | 2020-07-09 | 2279 | daily_visits | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
9 | 2020-07-10 | 2155 | daily_visits | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Let’s rerun the cross-validation procedure with the added exogenous variables.
timegpt_cv_df_with_ex = nixtla_client.cross_validation(
df,
h=7,
n_windows=8,
time_col='date',
target_col='users',
freq='D',
level=[80, 90, 99.5]
)
timegpt_cv_df_with_ex.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
unique_id | date | cutoff | users | TimeGPT | TimeGPT-lo-99.5 | TimeGPT-lo-90 | TimeGPT-lo-80 | TimeGPT-hi-80 | TimeGPT-hi-90 | TimeGPT-hi-99.5 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | daily_visits | 2021-07-01 | 2021-06-30 | 3123 | 3314.773743 | 2793.566942 | 3043.304261 | 3085.668122 | 3543.879364 | 3586.243226 | 3835.980544 |
1 | daily_visits | 2021-07-02 | 2021-06-30 | 2870 | 3093.066529 | 2139.727892 | 2725.964112 | 2779.082154 | 3407.050904 | 3460.168946 | 4046.405166 |
2 | daily_visits | 2021-07-03 | 2021-06-30 | 2020 | 2347.973573 | 1386.090529 | 1915.487550 | 1973.679628 | 2722.267519 | 2780.459596 | 3309.856618 |
3 | daily_visits | 2021-07-04 | 2021-06-30 | 1828 | 2182.467408 | 1003.677454 | 1681.246491 | 1874.572327 | 2490.362488 | 2683.688324 | 3361.257361 |
4 | daily_visits | 2021-07-05 | 2021-06-30 | 2722 | 3083.629453 | 1257.248435 | 2220.430357 | 2556.408628 | 3610.850279 | 3946.828550 | 4910.010472 |
Let’s plot our forecasts again and calculate our error.
nixtla_client.plot(
df,
timegpt_cv_df_with_ex.drop(columns=['cutoff', 'users']),
time_col='date',
target_col='users',
max_insample_length=90,
level=[80, 90, 99.5]
)
mae_timegpt_with_exogenous = mae(df = timegpt_cv_df_with_ex.drop(columns=['cutoff']),
models=['TimeGPT'],
target_col='users')
mae_timegpt_with_exogenous
unique_id | TimeGPT | |
---|---|---|
0 | daily_visits | 167.22857 |
To conclude, we obtain the following forecast results in this notebook:
mae_timegpt['Exogenous features'] = False
mae_timegpt_with_exogenous['Exogenous features'] = True
df_results = pd.concat([mae_timegpt, mae_timegpt_with_exogenous])
df_results = df_results.rename(columns={'TimeGPT':'MAE backtest'})
df_results = df_results.drop(columns={'unique_id'})
df_results['model'] = 'TimeGPT'
df_results[['model', 'Exogenous features', 'MAE backtest']]
model | Exogenous features | MAE backtest | |
---|---|---|---|
0 | TimeGPT | False | 167.691711 |
0 | TimeGPT | True | 167.228570 |
We’ve shown how to forecast daily visits of a website. We achieved almost 10% better forecasting results as compared to the original tutorial, using significantly less lines of code, in a fraction of the time required to run everything.
Did you notice how little effort that took? What you did not have to do, is:
- Elaborate data preprocessing - just a table with timeseries is sufficient
- Creating a validation- and test set - TimeGPT handles the cross-validation in a single function
- Choosing and testing different models - It’s just a single call to TimeGPT
- Hyperparameter tuning - Not necessary.
Happy forecasting!
Updated about 2 months ago