Releases: specula-org/SysMoBench
Releases · specula-org/SysMoBench
v1.0.0
SysMoBench evaluates AI on formally modeling complex real-world systems. It targets TLA+ and runs four kinds of evaluation end-to-end: SANY syntax, TLC runtime, transition validation against captured system traces, and invariant verification against expert templates.
The corresponding paper appears at ICLR 2026: "SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems". Up-to-date scores are at sysmobench.com.
Install
Container:
docker pull ghcr.io/specula-org/sysmobench:v1.0.0
docker run --rm -it -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ghcr.io/specula-org/sysmobench:v1.0.0
From source:
git clone --branch v1.0.0 https://github.com/specula-org/SysMoBench.git
cd SysMoBench
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
sysmobench-setup
Tasks
11 systems: spin, mutex, rwmutex, ringbuffer (Asterinas concurrent primitives); etcd, redisraft, raftkvs, curp (Raft / CURP consensus); zookeeper (distributed coordination); dqueue, locksvc (PGo-compiled distributed services).
Metrics
| Stage | Metric |
|---|---|
| Syntax | compilation_check, action_decomposition |
| Runtime | runtime_check, coverage, runtime_coverage |
| Transition validation | transition_validation |
| Invariant verification | invariant_verification |
Canonical phase weights: P1=0.15, P2=0.15, P3=0.35, P4=0.35.
Citation
@inproceedings{cheng2026sysmobench,
title = {SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems},
author = {Cheng, Qian and Tang, Ruize and Ma, Emilie and Hackett, Finn and
He, Peiyang and Su, Yiming and Beschastnikh, Ivan and Huang, Yu and
Ma, Xiaoxing and Xu, Tianyin},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
url = {https://arxiv.org/abs/2509.23130}
}