OpenT2T

OpenT2T is the codebase for the EMNLP 2024 System Demonstrations paper "OpenT2T: An Open-Source Toolkit for Table-to-Text Generation".

It is a reproducible benchmarking toolkit for table-to-text generation and table-grounded question answering with prompt construction, local inference backends, evaluation, reporting, and run manifests.

This codebase has also been recently updated to stay compatible with newer model families and the current vLLM inference pipeline.

The repository is intentionally dependency-light by default so the full pipeline can run locally against fixture datasets and a deterministic mock backend. Optional integrations are available for vllm and bert-score.

What the toolkit covers

OpenT2T supports:

dataset preparation and normalization
prompt construction for zero-shot and few-shot settings
local deterministic smoke runs through mock backends
vLLM-based inference for real models
lexical, semantic, and model-based evaluation
reporting through CSV, Markdown, and JSON run artifacts
model registries and static model packs

The main benchmark datasets currently wired into the framework are:

logicnlg
totto
hitabng
hitabqa
rotowire
numericnlg
scigen
fetaqa
qtsumm

Installation

OpenT2T requires Python 3.11 or newer.

Recommended reproducible environment

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .

After this, the opent2t CLI is available inside the environment.

Minimal local installation

python3 -m pip install -e .

Optional extras

python3 -m pip install -e .[vllm]
python3 -m pip install -e .[semantic]

Additional metric/runtime dependencies:

tapas_scores needs torch, transformers, and pandas
autoacu_scores needs autoacu and its runtime dependencies

Full environment for model and metric reproduction

If you want a closer-to-full experiment environment rather than the lightweight default install, use:

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .[semantic]
python -m pip install -e .[vllm]
python -m pip install torch transformers pandas autoacu

Notes:

install vllm only on a machine with a compatible CUDA and PyTorch stack
autoacu and tapas_scores are optional; the smoke pipeline does not require them
for CPU-only local sanity checks, the base install plus mock models is enough

Quick start

Run the built-in smoke benchmark:

python3 -m opent2t.cli benchmark --run-config configs/runs/smoke.json
python3 -m unittest discover -s tests -v

List the active model and pack surface:

python3 -m opent2t.cli list-models
python3 -m opent2t.cli list-packs

Typical CLI workflow

Step 1: prepare a dataset split

opent2t prepare-dataset --dataset totto --split dev

This uses the dataset config in configs/datasets/totto.json and writes a normalized JSONL file.

Step 2: build prompts

opent2t build-prompts \
  --dataset totto \
  --split dev \
  --model mock-chat \
  --mode zero_shot \
  --run-dir artifacts/runs/totto_dev_demo

Step 3: run inference

opent2t run-inference \
  --run-dir artifacts/runs/totto_dev_demo \
  --model mock-chat \
  --batch-size 4

Step 4: evaluate the run

opent2t evaluate \
  --run-dir artifacts/runs/totto_dev_demo \
  --metrics bleu rouge_l meteor

One-command benchmark

opent2t benchmark --run-config configs/runs/smoke.json

Data and config layout

Important directories:

configs/datasets/ contains dataset configs
configs/models/ contains model specs
configs/packs/ contains named model packs
prompts/ contains prompt templates and exemplar pools
fixtures/raw/ contains local fixture data for smoke tests
artifacts/runs/ contains run outputs

Dataset configs point at a raw_dir. For local use, you can either:

place the expected raw or normalized JSONL files in the configured path, or
edit the dataset config to point at your own local dataset location

The batch scripts can also download processed JSONL splits from a Hugging Face dataset repo specified through DATASET_REPO.

Model selection

The checked-in active packs are:

default
smoke
specialist
mock

You can inspect them with:

opent2t list-packs
opent2t preflight-models --pack default

Running experiments with Slurm

The repository includes two Slurm entrypoints:

scripts/model_benchmark_zero_shot.sbatch
scripts/qwen3_dataset_eval.sbatch

Important:

The checked-in scripts contain site-specific defaults in their #SBATCH headers.
Other users should override account, partition, log path, and environment paths on the sbatch command line, or adapt a local copy of the script.
Do not hardcode tokens in scripts. Export them from your shell and pass them through --export.

Generic single-job submission

export HF_TOKEN=<your_hf_token>

sbatch \
  -A <account> \
  -p <partition> \
  --gres=gpu:1 \
  -o /path/to/logs/%x-%j.out \
  --export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo>,RUN_ROOT=/path/to/runs,RAW_ROOT=/path/to/raw \
  scripts/model_benchmark_zero_shot.sbatch \
  qwen25_7b \
  totto \
  test

Useful exported overrides:

BATCH_SIZE
SAMPLE_SIZE
RUN_ROOT
RAW_ROOT
JOB_CACHE_ROOT

Example with a smaller sample:

sbatch \
  -A <account> \
  -p <partition> \
  --gres=gpu:1 \
  -o /path/to/logs/%x-%j.out \
  --export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo>,BATCH_SIZE=8,SAMPLE_SIZE=100 \
  scripts/model_benchmark_zero_shot.sbatch \
  phi4_mini \
  logicnlg \
  test

Generic Qwen3 sweep submission

sbatch \
  -A <account> \
  -p <partition> \
  --gres=gpu:1 \
  -o /path/to/logs/%x-%j.out \
  --export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo>,RUN_ROOT=/path/to/qwen3_runs,RAW_ROOT=/path/to/raw \
  scripts/qwen3_dataset_eval.sbatch \
  totto \
  test

Submit one job per benchmark dataset

for dataset in logicnlg totto hitabng hitabqa rotowire numericnlg scigen fetaqa qtsumm; do
  sbatch \
    -A <account> \
    -p <partition> \
    --gres=gpu:1 \
    -o /path/to/logs/%x-%j.out \
    --export=ALL,HF_TOKEN_VALUE=$HF_TOKEN,REPO_ROOT=/path/to/opent2t,ENV_ROOT=/path/to/env,DATASET_REPO=<hf_user>/<dataset_repo> \
    scripts/qwen3_dataset_eval.sbatch \
    "$dataset" \
    test
done

Matrix submission helper

If you want the repo to submit many jobs for you, use:

HF_TOKEN=<your_hf_token> \
PYTHONPATH=src \
python3 scripts/submit_model_benchmark_matrix.py \
  --ssh-host <cluster_login_host> \
  --repo-root /path/to/opent2t \
  --env-root /path/to/env \
  --run-root /path/to/runs \
  --include-qwen3

Useful flags:

--plan-only
--datasets logicnlg totto
--include-models qwen25_7b phi4_mini
--exclude-models internlm2_math_plus_20b

Monitoring jobs and results

Typical Slurm commands:

squeue -u <username>
sacct -j <job_id> --format=JobIDRaw,JobName,State,ExitCode -n -P
tail -f /path/to/logs/<job-name>-<job-id>.out

Typical run outputs:

Prompt text lives under prompts/.
Dataset fixtures live under fixtures/raw/.
Run outputs are written under artifacts/runs/.
Each completed run writes manifest.json, prompts.jsonl, generations.jsonl, parsed_predictions.jsonl, metrics_per_example.jsonl, metrics_aggregate.json, summary.csv, and summary.md.

Citation

If you use this repository, please cite:

@inproceedings{zhang-etal-2024-opent2t,
    title = "{O}pen{T}2{T}: An Open-Source Toolkit for Table-to-Text Generation",
    author = "Zhang, Haowei  and
      Si, Shengyun  and
      Zhao, Yilun  and
      Xie, Lujing  and
      Xu, Zhijian  and
      Chen, Lyuhao  and
      Nan, Linyong  and
      Wang, Pengcheng  and
      Tang, Xiangru  and
      Cohan, Arman",
    editor = "Hernandez Farias, Delia Irazu  and
      Hope, Tom  and
      Li, Manling",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-demo.27/",
    doi = "10.18653/v1/2024.emnlp-demo.27",
    pages = "259--269",
    abstract = "Table data is pervasive in various industries, and its comprehension and manipulation demand significant time and effort for users seeking to extract relevant information. Consequently, an increasing number of studies have been directed towards table-to-text generation tasks. However, most existing methods are benchmarked solely on a limited number of datasets with varying configurations, leading to a lack of unified, standardized, fair, and comprehensive comparison between methods. This paper presents OpenT2T, the first open-source toolkit for table-to-text generation, designed to reproduce existing large language models (LLMs) for performance comparison and expedite the development of new models.We have implemented and compared a wide range of LLMs under zero- and few-shot settings on 9 table-to-text generation datasets, covering data insight generation, table summarization, and free-form table question answering. Additionally, we maintain a public leaderboard to provide insights for future work into how to choose appropriate table-to-text generation systems for real-world scenarios."
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docs		docs
fixtures/raw		fixtures/raw
prompts		prompts
scripts		scripts
src/opent2t		src/opent2t
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenT2T

What the toolkit covers

Installation

Recommended reproducible environment

Minimal local installation

Optional extras

Full environment for model and metric reproduction

Quick start

Typical CLI workflow

Step 1: prepare a dataset split

Step 2: build prompts

Step 3: run inference

Step 4: evaluate the run

One-command benchmark

Data and config layout

Model selection

Running experiments with Slurm

Generic single-job submission

Generic Qwen3 sweep submission

Submit one job per benchmark dataset

Matrix submission helper

Monitoring jobs and results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenT2T

What the toolkit covers

Installation

Recommended reproducible environment

Minimal local installation

Optional extras

Full environment for model and metric reproduction

Quick start

Typical CLI workflow

Step 1: prepare a dataset split

Step 2: build prompts

Step 3: run inference

Step 4: evaluate the run

One-command benchmark

Data and config layout

Model selection

Running experiments with Slurm

Generic single-job submission

Generic Qwen3 sweep submission

Submit one job per benchmark dataset

Matrix submission helper

Monitoring jobs and results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages