Run TimeGPT distributedly on top of Spark

Spark is an open-source distributed computing framework designed for large-scale data processing. In this guide, we will explain how to use TimeGPT on top of Spark.

Outline:

  1. Installation

  2. Load Your Data

  3. Initialize Spark

  4. Use TimeGPT on Spark

  5. Stop Spark

1. Installation

Install Spark through Fugue. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of several distributed computing frameworks, including Spark.

Note

You can install fugue with pip:

pip install fugue[spark]

If executing on a distributed Spark cluster, ensure that the nixtla library is installed across all the workers.

2. Load Data

You can load your data as a pandas DataFrame. In this tutorial, we will use a dataset that contains hourly electricity prices from different markets.

import pandas as pd 
df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv') 
df.head()
unique_iddsy
0BE2016-10-22 00:00:0070.00
1BE2016-10-22 01:00:0037.10
2BE2016-10-22 02:00:0037.10
3BE2016-10-22 03:00:0044.75
4BE2016-10-22 04:00:0037.10

3. Initialize Spark

Initialize Spark and convert the pandas DataFrame to a Spark DataFrame.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show(5)

4. Use TimeGPT on Spark

Using TimeGPT on top of Spark is almost identical to the non-distributed case. The only difference is that you need to use a Spark DataFrame.

First, instantiate the NixtlaClient class.

from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)

Then use any method from the NixtlaClient class such as forecast or cross_validation.

fcst_df = nixtla_client.forecast(spark_df, h=12)
fcst_df.show(5)
cv_df = nixtla_client.cross_validation(spark_df, h=12, n_windows=5, step_size=2)
cv_df.show(5)

You can also use exogenous variables with TimeGPT on top of Spark. To do this, please refer to the Exogenous Variables tutorial. Just keep in mind that instead of using a pandas DataFrame, you need to use a Spark DataFrame instead.

5. Stop Spark

When you are done, stop the Spark session.

spark.stop()