Skip to content

Releases: specula-org/SysMoBench

v1.0.0

03 May 15:35

Choose a tag to compare

SysMoBench evaluates AI on formally modeling complex real-world systems. It targets TLA+ and runs four kinds of evaluation end-to-end: SANY syntax, TLC runtime, transition validation against captured system traces, and invariant verification against expert templates.

The corresponding paper appears at ICLR 2026: "SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems". Up-to-date scores are at sysmobench.com.

Install

Container:

docker pull ghcr.io/specula-org/sysmobench:v1.0.0
docker run --rm -it -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ghcr.io/specula-org/sysmobench:v1.0.0

From source:

git clone --branch v1.0.0 https://github.com/specula-org/SysMoBench.git
cd SysMoBench
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
sysmobench-setup

Tasks

11 systems: spin, mutex, rwmutex, ringbuffer (Asterinas concurrent primitives); etcd, redisraft, raftkvs, curp (Raft / CURP consensus); zookeeper (distributed coordination); dqueue, locksvc (PGo-compiled distributed services).

Metrics

Stage Metric
Syntax compilation_check, action_decomposition
Runtime runtime_check, coverage, runtime_coverage
Transition validation transition_validation
Invariant verification invariant_verification

Canonical phase weights: P1=0.15, P2=0.15, P3=0.35, P4=0.35.

Citation

@inproceedings{cheng2026sysmobench,
  title     = {SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems},
  author    = {Cheng, Qian and Tang, Ruize and Ma, Emilie and Hackett, Finn and
               He, Peiyang and Su, Yiming and Beschastnikh, Ivan and Huang, Yu and
               Ma, Xiaoxing and Xu, Tianyin},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2509.23130}
}