Toto - Time Series Optimized Transformer for Observability

Paper | Toto Model Card | BOOM Dataset Card | Blogpost

Toto is a foundation model for multivariate time series forecasting with a focus on observability metrics. This model leverages innovative architectural designs to efficiently handle the high-dimensional, complex time series that are characteristic of observability data.

This repository also hosts the code for evaluating time series models on BOOM (Benchmark of Observability Metrics), a large-scale forecasting dataset composed of real-world observability data.

Updates

🎉🎉 [Feb 2025] Fine-tuning Support: You can now fine-tune Toto on your own datasets! Includes a ready-to-use training script, example configs, and a hands-on tutorial notebook to get you started.
📈 [Feb 2025] Exogenous Covariate Support: Toto now supports known future exogenous covariates (e.g., weather forecasts, scheduled events) during both fine-tuning and inference to improve forecasting accuracy.

Toto model

Features

Zero-Shot Forecasting: Perform forecasting without fine-tuning on your specific time series
State-of-the-Art Performance: Achieves top scores in benchmarks covering diverse time series forecasting tasks. This includes the established multi-domain benchmark GIFT-Eval, as well as our own observability-focused benchmark BOOM.
Multi-Variate Support: Efficiently process multiple variables using Proportional Factorized Space-Time Attention
Probabilistic Predictions: Generate both point forecasts and uncertainty estimates using a Student-T mixture model
High-Dimensional Support: Handle time series with a large number of variables efficiently
Decoder-Only Architecture: Support for variable prediction horizons and context lengths
Pre-trained on Massive Data: Trained on over 2 trillion time series data points, the largest pretraining dataset for any open-weights time series foundation model to date.

Model Weights

Toto-Open, the open-weights release of Toto, is available on Hugging Face. Currently available checkpoints:

Checkpoint	Parameters	Notes
Toto-Open-Base-1.0	151M	The initial open relase of Toto. Achieves state-of-the-art performance on both general-purpose and observability-focused benchmarking tasks, as described in our paper.

Installation

# Optional: create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install via pip
pip install toto-ts

Or install as a local editable package (recommended for development or fine-tuning):

cd Toto
pip install -r requirements.txt
pip install -e .

For optimal inference speed, it's recommended to install xformers and flash-attention as well.

Quick Start

Here's a simple example to get you started with forecasting:

⚠️ In our study, we take the median across 256 samples to produce a point forecast. This tutorial previously used the mean but has now been updated.

import torch
from toto.data.util.dataset import MaskedTimeseries
from toto.inference.forecaster import TotoForecaster
from toto.model.toto import Toto

# Load the pre-trained model
toto = Toto.from_pretrained('Datadog/Toto-Open-Base-1.0')
toto.to('cuda')  # Move to GPU

# Optionally compile the model for faster inference
toto.compile()  # Uses Torch's JIT compilation for better performance

forecaster = TotoForecaster(toto.model)

# Prepare your input time series (channels, time_steps)
input_series = torch.randn(7, 4096).to('cuda')  # Example with 7 variables and 4096 timesteps

# Prepare timestamp information (optional, but expected by API; not used by the current model release)
timestamp_seconds = torch.zeros(7, 4096).to('cuda')
time_interval_seconds = torch.full((7,), 60*15).to('cuda')  # 15-minute intervals

# Create a MaskedTimeseries object
inputs = MaskedTimeseries(
    series=input_series,
    padding_mask=torch.full_like(input_series, True, dtype=torch.bool),
    id_mask=torch.zeros_like(input_series),
    timestamp_seconds=timestamp_seconds,
    time_interval_seconds=time_interval_seconds,
)

# Generate forecasts for the next 336 timesteps
forecast = forecaster.forecast(
    inputs,
    prediction_length=336,
    num_samples=256,  # Number of samples for probabilistic forecasting
    samples_per_batch=256,  # Control memory usage during inference
)

# Access results
median_prediction = forecast.median  # Point forecasts
prediction_samples = forecast.samples  # Probabilistic samples
lower_quantile = forecast.quantile(0.1)  # 10th percentile for lower confidence bound
upper_quantile = forecast.quantile(0.9)  # 90th percentile for upper confidence bound

Tutorials

For a comprehensive guide on using Toto for time series forecasting, check out our tutorial notebooks:

Basic Inference Tutorial: Learn how to load the model and make forecasts
Fine-tuning Tutorial: Learn how to fine-tune Toto on custom datasets with or without exogenous covariates

Pre-Training Data

Toto was trained on a massive and diverse mixture of time series datasets:

Observability Data

The largest portion of pretraining data comes from a dataset of approximately 1 trillion time series points collected from Datadog metrics. These metrics are generated from Datadog's monitoring of internal systems, and do not include any customer data. They cover a diverse array of software stacks and types of services, and span wide variety of domains within observability, including application performance, infrastructure, networking, security, databases, and more.

Public Datasets

To improve the performance of Toto on general-purpose time series forecasting across many domains, we include publicly available datasets:

GiftEval Pretrain
Chronos pretraining data (Note: only a subset of this dataset was used to avoid leakage with the GiftEval benchmark)

Synthetic Data

To improve robustness, approximately 1/3 of the pretraining data mix consists of synthetically-generated time series.

Evaluation

Toto has been rigorously evaluated on multiple benchmarks, including both general-purpose datasets and observability-focused datasets like BOOM. Below, we provide instructions for reproducing our evaluation results.

LSF Evaluation

To reproduce our results on the LSF datasets, follow these steps:

Downloading the Datasets

The LSF evaluation requires three datasets: ETT, Electricity, and Weather. You can download them from the Time-Series-Library repository. Follow the instructions in the repository to obtain the following already pre-processed datasets:

ETT (Electricity Transformer Temperature): Includes four subsets: ETTh1, ETTh2, ETTm1, and ETTm2.
Electricity
Weather

After downloading, ensure the datasets are placed in the data/lsf_datasets/ directory within the repository, with the following structure:

data/
└── lsf_datasets/
  ├── ETT-small/
  ├── electricity/
  └── weather/

Running the Evaluation Script

Once the datasets are set up, you can run the LSF evaluation script as follows to reproduce our results:

export CUBLAS_WORKSPACE_CONFIG=:4096:8  # For reproducible GPU results
export PYTHONPATH="$(pwd):$(pwd)/toto:$PYTHONPATH"  # Add current and "toto" dirs to Python module search path
python toto/evaluation/run_lsf_eval.py \
    --datasets ETTh1 \
    --context-length 2048 \
    --eval-stride 1 \
    --checkpoint-path [CHECKPOINT-NAME-OR-DIR]

To see all available options for the evaluation script, you can use the --help flag:

python toto/evaluation/run_lsf_eval.py --help

Expected Results

The script evaluates Toto's performance using Mean Absolute Error (MAE) and Mean Squared Error (MSE) across the specified datasets, context lengths, and prediction lengths. It displays a detailed table of results for each prediction length, along with a summary table that averages the results across prediction lengths for each dataset.

To reproduce the results presented in the paper, use the default arguments while setting --eval-stride 1 and specifying all datasets with --datasets ETTh1 ETTh2 ETTm1 ETTm2 weather electricity.

GIFT-Eval Evaluation

To reproduce our results on the GIFT-Eval benchmark, we provide a dedicated notebook:

GIFT-Eval Evaluation Notebook: Step-by-step instructions for running Toto on the GIFT-Eval benchmark and reproducing the reported results.

BOOM Evaluation

For evaluating Toto on the BOOM (Benchmark of Observability Metrics) dataset, refer to:

BOOM Evaluation Notebook: Example workflow for running Toto on the BOOM dataset.
BOOM README: Detailed instructions and scripts for benchmarking on BOOM.

These resources provide all necessary steps to run and reproduce BOOM evaluation results with Toto.

🆕 Fine-tuning

Toto can be fine-tuned on your own domain-specific datasets to improve performance on specialized forecasting tasks. The fine-tuning pipeline supports both standard time series and datasets with exogenous (known future) variables.

Fine-tuning Tutorial

To fine-tune Toto, use the provided finetuning tutorial, which demonstrates fine-tuning with and without exogenous variables.

To customize the fine-tuning recipe, modify the base configuration in finetune_config.yaml.

By default, the tutorial uses the proenfo_gfc12 dataset from the autogluon/fev_datasets collection.

Custom Datasets

There are two ways to use custom datasets for fine-tuning:

Option A: HuggingFace Dataset with Configuration Dictionary

The simplest approach is to use a HuggingFace datasets.Dataset configured via a dictionary. Modify the prepare_dataset() function in benchmark_finetuning.py to load your data:

custom_dataset = {
    "dataset": dataset,              # HuggingFace Dataset object
    "target_fields": ["target"],     # List of field names for target variables
    "target_transform_fns": [...],   # Transform functions for each target field
    "ev_fields": ["temp", "humidity"],  # List of exogenous covariate field names
    "ev_transform_fns": [...],       # Transform functions for each exogenous field
    "dataset_name": "my_dataset",    # Name of your custom dataset
}

HuggingFace Dataset Requirements:

Your dataset must contain:

timestamp: A 1D array of timestamps for each time series
Target fields (e.g., target): Arrays of shape (T,) for each target variable
Exogenous fields (optional): Arrays of shape (T,) for each dynamic exogenous variable

The pipeline uses FinetuneDataModule, which internally converts your data into CausalMaskedTimeseries objects (the input format expected by Toto during fine-tuning) via GluonTS transforms.

Option B: Custom PyTorch Dataset and DataModule

For full control over data loading, you can implement your own PyTorch Dataset that returns CausalMaskedTimeseries objects and wrap it in a custom LightningDataModule.

Step 1: Create a Dataset class that returns CausalMaskedTimeseries:

from torch.utils.data import Dataset
from toto.data.util.dataset import CausalMaskedTimeseries

class MyCustomDataset(Dataset):
    ...
    def __getitem__(self, idx: int) -> CausalMaskedTimeseries:
        # Build and return a CausalMaskedTimeseries for this sample
        # See toto/data/datasets/gluonts_dataset.py for a reference implementation
        ...

Step 2: Create a custom LightningDataModule:

from lightning import LightningDataModule
from torch.utils.data import DataLoader
from toto.data.util.helpers import collate_causal

class MyFinetuneDataModule(LightningDataModule):
    def __init__(self, train_dataset: MyCustomDataset, val_dataset: MyCustomDataset, ...):
        ...

    def train_dataloader(self) -> DataLoader:
        return DataLoader(self.train_dataset, collate_fn=collate_causal, ...)  # collate_fn is required

    def val_dataloader(self) -> DataLoader:
        return DataLoader(self.val_dataset, collate_fn=collate_causal, ...)

Step 3: Modify finetune_toto.py to use your custom DataModule:

# Replace the get_datamodule() call with your custom DataModule
dm = MyFinetuneDataModule(train_dataset, val_dataset, batch_size=16)
_ = train(module, dm, config)

Evaluations on FEV Datasets

The benchmark_finetuning.py script evaluates Toto on a subset of FEV datasets that are not included in Toto’s pretraining corpus. These datasets contain known exogenous variables, enabling a comparison of three approaches:

Zero-shot Toto — No fine-tuning
Fine-tuned Toto — Fine-tuned without exogenous variables
Fine-tuned Toto with Exogenous Variables — Fine-tuned with known future covariates

Models are evaluated using sliding windows on the test set (10% of each dataset), with context length and horizon configured per FEV task. Results are aggregated using the geometric mean across datasets in aggregate_results.ipynb:

Model	MAE	WQL	MASE
Toto (zero-shot)	6150.242	0.111	0.632
Toto (fine-tuned)	5397.929	0.100	0.574
Toto (fine-tuned + exogenous)	5117.002	0.096	0.535

Requirements

Python 3.10+
PyTorch 2.5+
CUDA-capable device (Ampere generation or newer recommended for optimal performance)

BOOM (Benchmark of Observability Metrics)

BOOM (Benchmark of Observability Metrics) is a large-scale, real-world time series dataset designed for evaluating models on forecasting tasks in complex observability environments. Composed of real-world metrics data collected from Datadog, a leading observability platform, the benchmark captures the irregularity, structural complexity, and heavy-tailed statistics typical of production observability data. Unlike synthetic or curated benchmarks, BOOM reflects the full diversity and unpredictability of operational signals observed in distributed systems, covering infrastructure, networking, databases, security, and application-level metrics.

Note: the metrics comprising BOOM were generated from internal monitoring of pre-production environments, and do not include any customer data.

For more information on the dataset, including details on its preparation and statistical properties, see the dataset card in Hugging Face.

For example evaluations of different time series models on the BOOM dataset, see the boom folder in this repository.

Citation

If you use Toto in your research, please cite our work:

@misc{cohen2025timedifferentobservabilityperspective,
      title={This Time is Different: An Observability Perspective on Time Series Foundation Models}, 
      author={Ben Cohen and Emaad Khwaja and Youssef Doubli and Salahidine Lemaachi and Chris Lettieri and Charles Masson and Hugo Miccinilli and Elise Ramé and Qiqi Ren and Afshin Rostamizadeh and Jean Ogier du Terrail and Anna-Monica Toon and Kan Wang and Stephan Xie and Zongzhe Xu and Viktoriya Zhukova and David Asker and Ameet Talwalkar and Othmane Abou-Amal},
      year={2025},
      eprint={2505.14766},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.14766}, 
}

License

Unless explicitly stated otherwise all files in this repository are licensed under the Apache-2.0 License - see LICENSE file for details.

Contributing

We welcome contributions! Please check out our contributing guidelines to get started.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github		.github
boom		boom
toto		toto
.ddla-overrides		.ddla-overrides
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-3rdparty.csv		LICENSE-3rdparty.csv
NOTICE		NOTICE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

DataDog/toto

Folders and files

Latest commit

History

Repository files navigation

Toto - Time Series Optimized Transformer for Observability

Updates

Table of Contents

Toto model

Features

Model Weights

Installation

Quick Start

Tutorials

Pre-Training Data

Observability Data

Public Datasets

Synthetic Data

Evaluation

LSF Evaluation

Downloading the Datasets

Running the Evaluation Script

Expected Results

GIFT-Eval Evaluation

BOOM Evaluation

🆕 Fine-tuning

Fine-tuning Tutorial

Custom Datasets

Option A: HuggingFace Dataset with Configuration Dictionary

Option B: Custom PyTorch Dataset and DataModule

Evaluations on FEV Datasets

Requirements

BOOM (Benchmark of Observability Metrics)

Citation

License

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 15

Languages

Packages