Run TimeGPT distributedly on top of Spark

Spark is an open-source distributed computing framework designed for large-scale data processing. In this guide, we will explain how to use TimeGPT on top of Spark.


  1. Installation

  2. Load Your Data

  3. Initialize Spark

  4. Use TimeGPT on Spark

  5. Stop Spark

1. Installation

Install Spark through Fugue. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of several distributed computing frameworks, including Spark.


You can install fugue with pip:

pip install fugue[spark]

If executing on a distributed Spark cluster, ensure that the nixtla library is installed across all the workers.

2. Load Data

You can load your data as a pandas DataFrame. In this tutorial, we will use a dataset that contains hourly electricity prices from different markets.

import pandas as pd 
df = pd.read_csv('') 
0BE2016-10-22 00:00:0070.00
1BE2016-10-22 01:00:0037.10
2BE2016-10-22 02:00:0037.10
3BE2016-10-22 03:00:0044.75
4BE2016-10-22 04:00:0037.10

3. Initialize Spark

Initialize Spark and convert the pandas DataFrame to a Spark DataFrame.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(df)

4. Use TimeGPT on Spark

Using TimeGPT on top of Spark is almost identical to the non-distributed case. The only difference is that you need to use a Spark DataFrame.

First, instantiate the NixtlaClient class.

from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'

Then use any method from the NixtlaClient class such as forecast or cross_validation.

fcst_df = nixtla_client.forecast(spark_df, h=12)
cv_df = nixtla_client.cross_validation(spark_df, h=12, n_windows=5, step_size=2)

You can also use exogenous variables with TimeGPT on top of Spark. To do this, please refer to the Exogenous Variables tutorial. Just keep in mind that instead of using a pandas DataFrame, you need to use a Spark DataFrame instead.

5. Stop Spark

When you are done, stop the Spark session.