OpenT2T is the codebase for the EMNLP 2024 System Demonstrations paper
"OpenT2T: An Open-Source Toolkit for Table-to-Text Generation".
It is a reproducible benchmarking toolkit for table-to-text generation and table-grounded question answering with prompt construction, local inference backends, evaluation, reporting, and run manifests.
This codebase has also been recently updated to stay compatible with newer model families and the current vLLM inference pipeline.
The repository is intentionally dependency-light by default so the full pipeline can run locally against fixture datasets and a deterministic mock backend. Optional integrations are available for vllm and bert-score.
OpenT2T supports:
- dataset preparation and normalization
- prompt construction for zero-shot and few-shot settings
- local deterministic smoke runs through mock backends
- vLLM-based inference for real models
- lexical, semantic, and model-based evaluation
- reporting through CSV, Markdown, and JSON run artifacts
- model registries and static model packs
The main benchmark datasets currently wired into the framework are:
logicnlgtottohitabnghitabqarotowirenumericnlgscigenfetaqaqtsumm
OpenT2T requires Python 3.11 or newer.
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .After this, the opent2t CLI is available inside the environment.
python3 -m pip install -e .python3 -m pip install -e .[vllm]
python3 -m pip install -e .[semantic]Additional metric/runtime dependencies:
tapas_scoresneedstorch,transformers, andpandasautoacu_scoresneedsautoacuand its runtime dependencies
If you want a closer-to-full experiment environment rather than the lightweight default install, use:
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .[semantic]
python -m pip install -e .[vllm]
python -m pip install torch transformers pandas autoacuNotes:
- install
vllmonly on a machine with a compatible CUDA and PyTorch stack autoacuandtapas_scoresare optional; the smoke pipeline does not require them- for CPU-only local sanity checks, the base install plus mock models is enough
Run the built-in smoke benchmark:
python3 -m opent2t.cli benchmark --run-config configs/runs/smoke.json
python3 -m unittest discover -s tests -vList the active model and pack surface:
python3 -m opent2t.cli list-models
python3 -m opent2t.cli list-packsopent2t prepare-dataset --dataset totto --split devThis uses the dataset config in configs/datasets/totto.json and writes a
normalized JSONL file.
opent2t build-prompts \
--dataset totto \
--split dev \
--model mock-chat \
--mode zero_shot \
--run-dir artifacts/runs/totto_dev_demoopent2t run-inference \
--run-dir artifacts/runs/totto_dev_demo \
--model mock-chat \
--batch-size 4opent2t evaluate \
--run-dir artifacts/runs/totto_dev_demo \
--metrics bleu rouge_l meteoropent2t benchmark --run-config configs/runs/smoke.jsonImportant directories:
configs/datasets/contains dataset configsconfigs/models/contains model specsconfigs/packs/contains named model packsprompts/contains prompt templates and exemplar poolsfixtures/raw/contains local fixture data for smoke testsartifacts/runs/contains run outputs
Dataset configs point at a raw_dir. For local use, you can either:
- place the expected raw or normalized JSONL files in the configured path, or
- edit the dataset config to point at your own local dataset location
The batch scripts can also download processed JSONL splits from a Hugging Face
dataset repo specified through DATASET_REPO.
The checked-in active packs are:
defaultsmokespecialistmock
You can inspect them with:
opent2t list-packs
opent2t preflight-models --pack defaultThe repository includes two Slurm entrypoints:
scripts/model_benchmark_zero_shot.sbatchscripts/qwen3_dataset_eval.sbatch
Important:
- The checked-in scripts contain site-specific defaults in their
#SBATCHheaders. - Other users should override account, partition, log path, and environment paths on the
sbatchcommand line, or adapt a local copy of the script. - Do not hardcode tokens in scripts. Export them from your shell and pass them through
--export.
export HF_TOKEN=<your_hf_token>
sbatch \
-A <account> \
-p <partition> \
--gres=gpu:1 \
-o /path/to/logs/%x-%j.out \
--export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo>,RUN_ROOT=/path/to/runs,RAW_ROOT=/path/to/raw \
scripts/model_benchmark_zero_shot.sbatch \
qwen25_7b \
totto \
testUseful exported overrides:
BATCH_SIZESAMPLE_SIZERUN_ROOTRAW_ROOTJOB_CACHE_ROOT
Example with a smaller sample:
sbatch \
-A <account> \
-p <partition> \
--gres=gpu:1 \
-o /path/to/logs/%x-%j.out \
--export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo>,BATCH_SIZE=8,SAMPLE_SIZE=100 \
scripts/model_benchmark_zero_shot.sbatch \
phi4_mini \
logicnlg \
testsbatch \
-A <account> \
-p <partition> \
--gres=gpu:1 \
-o /path/to/logs/%x-%j.out \
--export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo>,RUN_ROOT=/path/to/qwen3_runs,RAW_ROOT=/path/to/raw \
scripts/qwen3_dataset_eval.sbatch \
totto \
testfor dataset in logicnlg totto hitabng hitabqa rotowire numericnlg scigen fetaqa qtsumm; do
sbatch \
-A <account> \
-p <partition> \
--gres=gpu:1 \
-o /path/to/logs/%x-%j.out \
--export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo> \
scripts/qwen3_dataset_eval.sbatch \
"$dataset" \
test
doneIf you want the repo to submit many jobs for you, use:
HF_TOKEN=<your_hf_token> \
PYTHONPATH=src \
python3 scripts/submit_model_benchmark_matrix.py \
--ssh-host <cluster_login_host> \
--repo-root /path/to/opent2t \
--env-root /path/to/env \
--run-root /path/to/runs \
--include-qwen3Useful flags:
--plan-only--datasets logicnlg totto--include-models qwen25_7b phi4_mini--exclude-models internlm2_math_plus_20b
Typical Slurm commands:
squeue -u <username>
sacct -j <job_id> --format=JobIDRaw,JobName,State,ExitCode -n -P
tail -f /path/to/logs/<job-name>-<job-id>.outTypical run outputs:
- Prompt text lives under
prompts/. - Dataset fixtures live under
fixtures/raw/. - Run outputs are written under
artifacts/runs/. - Each completed run writes
manifest.json,prompts.jsonl,generations.jsonl,parsed_predictions.jsonl,metrics_per_example.jsonl,metrics_aggregate.json,summary.csv, andsummary.md.
If you use this repository, please cite:
@inproceedings{zhang-etal-2024-opent2t,
title = "{O}pen{T}2{T}: An Open-Source Toolkit for Table-to-Text Generation",
author = "Zhang, Haowei and
Si, Shengyun and
Zhao, Yilun and
Xie, Lujing and
Xu, Zhijian and
Chen, Lyuhao and
Nan, Linyong and
Wang, Pengcheng and
Tang, Xiangru and
Cohan, Arman",
editor = "Hernandez Farias, Delia Irazu and
Hope, Tom and
Li, Manling",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-demo.27/",
doi = "10.18653/v1/2024.emnlp-demo.27",
pages = "259--269",
abstract = "Table data is pervasive in various industries, and its comprehension and manipulation demand significant time and effort for users seeking to extract relevant information. Consequently, an increasing number of studies have been directed towards table-to-text generation tasks. However, most existing methods are benchmarked solely on a limited number of datasets with varying configurations, leading to a lack of unified, standardized, fair, and comprehensive comparison between methods. This paper presents OpenT2T, the first open-source toolkit for table-to-text generation, designed to reproduce existing large language models (LLMs) for performance comparison and expedite the development of new models.We have implemented and compared a wide range of LLMs under zero- and few-shot settings on 9 table-to-text generation datasets, covering data insight generation, table summarization, and free-form table question answering. Additionally, we maintain a public leaderboard to provide insights for future work into how to choose appropriate table-to-text generation systems for real-world scenarios."
}