Spark
Run TimeGPT distributedly on top of Spark
Spark is an open-source distributed computing framework designed for large-scale data processing. In this guide, we will explain how to use TimeGPT
on top of Spark.
Outline:
1. Installation
Install Spark through Fugue. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of several distributed computing frameworks, including Spark.
Note
You can install
fugue
withpip
:pip install fugue[spark]
If executing on a distributed Spark
cluster, ensure that the nixtla
library is installed across all the workers.
2. Load Data
You can load your data as a pandas
DataFrame. In this tutorial, we will use a dataset that contains hourly electricity prices from different markets.
import pandas as pd
df = pd.read_csv(
'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
parse_dates=['ds'],
)
df.head()
unique_id | ds | y | |
---|---|---|---|
0 | BE | 2016-10-22 00:00:00 | 70.00 |
1 | BE | 2016-10-22 01:00:00 | 37.10 |
2 | BE | 2016-10-22 02:00:00 | 37.10 |
3 | BE | 2016-10-22 03:00:00 | 44.75 |
4 | BE | 2016-10-22 04:00:00 | 37.10 |
3. Initialize Spark
Initialize Spark
and convert the pandas DataFrame to a Spark
DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show(5)
4. Use TimeGPT on Spark
Using TimeGPT
on top of Spark
is almost identical to the non-distributed case. The only difference is that you need to use a Spark
DataFrame.
First, instantiate the NixtlaClient
class.
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
# defaults to os.environ.get("NIXTLA_API_KEY")
api_key = 'my_api_key_provided_by_nixtla'
)
Use an Azure AI endpoint
To use an Azure AI endpoint, set the
base_url
argument:
nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")
Then use any method from the NixtlaClient
class such as forecast
or cross_validation
.
fcst_df = nixtla_client.forecast(spark_df, h=12)
fcst_df.show(5)
Available models in Azure AI
If you are using an Azure AI endpoint, please be sure to set
model="azureai"
:
nixtla_client.forecast(..., model="azureai")
For the public API, we support two models:
timegpt-1
andtimegpt-1-long-horizon
.By default,
timegpt-1
is used. Please see this tutorial on how and when to usetimegpt-1-long-horizon
.
cv_df = nixtla_client.cross_validation(spark_df, h=12, n_windows=5, step_size=2)
cv_df.show(5)
You can also use exogenous variables with TimeGPT
on top of Spark
. To do this, please refer to the Exogenous Variables tutorial. Just keep in mind that instead of using a pandas DataFrame, you need to use a Spark
DataFrame instead.
5. Stop Spark
When you are done, stop the Spark
session.
spark.stop()
Updated 17 days ago