Spark

Run TimeGPT distributedly on top of Spark

Spark is an open-source distributed computing framework designed for large-scale data processing. In this guide, we will explain how to use TimeGPT on top of Spark.

Outline:

Installation
Load Your Data
Initialize Spark
Use TimeGPT on Spark
Stop Spark

1. Installation

Install Spark through Fugue. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of several distributed computing frameworks, including Spark.

Note

You can install fugue with pip:
pip install fugue[spark]

If executing on a distributed Spark cluster, ensure that the nixtla library is installed across all the workers.

2. Load Data

You can load your data as a pandas DataFrame. In this tutorial, we will use a dataset that contains hourly electricity prices from different markets.

import pandas as pd

df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
    parse_dates=['ds'],
) 
df.head()

	unique_id	ds	y
0	BE	2016-10-22 00:00:00	70.00
1	BE	2016-10-22 01:00:00	37.10
2	BE	2016-10-22 02:00:00	37.10
3	BE	2016-10-22 03:00:00	44.75
4	BE	2016-10-22 04:00:00	37.10

3. Initialize Spark

Initialize Spark and convert the pandas DataFrame to a Spark DataFrame.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark_df = spark.createDataFrame(df)
spark_df.show(5)

4. Use TimeGPT on Spark

Using TimeGPT on top of Spark is almost identical to the non-distributed case. The only difference is that you need to use a Spark DataFrame.

First, instantiate the NixtlaClient class.

from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)

👍
Use an Azure AI endpoint
To use an Azure AI endpoint, set the base_url argument:
nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")

Then use any method from the NixtlaClient class such as forecast or cross_validation.

fcst_df = nixtla_client.forecast(spark_df, h=12)
fcst_df.show(5)

📘
Available models in Azure AI
If you are using an Azure AI endpoint, please be sure to set model="azureai":
nixtla_client.forecast(..., model="azureai")
For the public API, we support two models: timegpt-1 and timegpt-1-long-horizon.
By default, timegpt-1 is used. Please see this tutorial on how and when to use timegpt-1-long-horizon.

cv_df = nixtla_client.cross_validation(spark_df, h=12, n_windows=5, step_size=2)
cv_df.show(5)

You can also use exogenous variables with TimeGPT on top of Spark. To do this, please refer to the Exogenous Variables tutorial. Just keep in mind that instead of using a pandas DataFrame, you need to use a Spark DataFrame instead.

5. Stop Spark

When you are done, stop the Spark session.

spark.stop()

Spark

1. Installation

2. Load Data

3. Initialize Spark

4. Use TimeGPT on Spark

👍
Use an Azure AI endpoint

📘
Available models in Azure AI

5. Stop Spark

1. Installation

2. Load Data

3. Initialize Spark

4. Use TimeGPT on Spark

👍Use an Azure AI endpoint

📘Available models in Azure AI

5. Stop Spark

👍
Use an Azure AI endpoint

📘
Available models in Azure AI