Releases · specula-org/SysMoBench

SysMoBench evaluates AI on formally modeling complex real-world systems. It targets TLA+ and runs four kinds of evaluation end-to-end: SANY syntax, TLC runtime, transition validation against captured system traces, and invariant verification against expert templates.

The corresponding paper appears at ICLR 2026: "SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems". Up-to-date scores are at sysmobench.com.

Install

Container:

docker pull ghcr.io/specula-org/sysmobench:v1.0.0
docker run --rm -it -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ghcr.io/specula-org/sysmobench:v1.0.0

From source:

git clone --branch v1.0.0 https://github.com/specula-org/SysMoBench.git
cd SysMoBench
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
sysmobench-setup

Tasks

11 systems: spin, mutex, rwmutex, ringbuffer (Asterinas concurrent primitives); etcd, redisraft, raftkvs, curp (Raft / CURP consensus); zookeeper (distributed coordination); dqueue, locksvc (PGo-compiled distributed services).

Metrics

Stage	Metric
Syntax	`compilation_check`, `action_decomposition`
Runtime	`runtime_check`, `coverage`, `runtime_coverage`
Transition validation	`transition_validation`
Invariant verification	`invariant_verification`

Canonical phase weights: P1=0.15, P2=0.15, P3=0.35, P4=0.35.

Citation

@inproceedings{cheng2026sysmobench,
  title     = {SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems},
  author    = {Cheng, Qian and Tang, Ruize and Ma, Emilie and Hackett, Finn and
               He, Peiyang and Su, Yiming and Beschastnikh, Ivan and Huang, Yu and
               Ma, Xiaoxing and Xu, Tianyin},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2509.23130}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Install

Tasks

Metrics

Citation

Uh oh!

Releases: specula-org/SysMoBench

v1.0.0

Install

Tasks

Metrics

Citation

Uh oh!