An independent, reproducible leaderboard for medical image segmentation models, where every score has a confidence interval and a ranking-stability test, and you can reproduce the whole thing with one command.
Most medical image segmentation leaderboards report a single Dice number per model with no interval, so you cannot tell a real lead from sampling noise. The people who study segmentation metrics have shown this is not a nitpick:
- Removing a single test case changes the rank of most teams in a majority of segmentation challenges (Maier-Hein et al., Nature Communications 2018).
- In over 60% of MICCAI 2023 segmentation papers, the second-ranked method sits inside the first's confidence interval, i.e. the reported "winner" is not statistically distinguishable (Confidence Intervals Uncovered, MICCAI 2024).
segauge-benchmark turns that critique into a running tool. It:
- runs real models on public CT/MRI data,
- scores them with segauge (Dice, HD95, Normalized Surface Dice, ASSD), putting a bootstrap confidence interval on every number,
- reports a ranking-stability test (how often each model is actually rank 1 under case resampling) and a pairwise significance test (is #1 separable from #2?),
- is contamination-aware: a model is only ranked on a dataset it was not trained on, and
- is reproducible with one command on data anyone can download.
There is already a good independent benchmark in this space, Touchstone (NeurIPS 2024), which evaluates models on large multi-center CT data. We do not claim better or secret data. The wedge here is different and complementary:
- Statistics first. Every row carries a CI and a ranking-stability result. A submission-server benchmark structurally cannot do this, because it never returns per-case predictions.
- Reproducible on public data. You can re-run every number yourself with
segbench run. The trade-off, stated plainly, is that public-ground-truth data is in-distribution for some models; we handle that with an explicit contamination policy rather than by pretending it away. - Failure-mode slicing. Per-organ today; per-scanner and demographic where the dataset provides it.
git clone https://github.com/RubenHaisma/segauge-benchmark
cd segauge-benchmark
uv sync
# Reproduce the seed leaderboard (downloads a few KiTS23 cases, runs on CPU):
uv run segbench run --config configs/kits23_seed.yaml --render docs
# Render the site from an existing results file:
uv run segbench render --results results/kits23_seed.json --out docs
# Verify the published ranking follows exactly from the published per-case scores
# (no inference, no GPU; this is the gate CI runs on every push):
uv run segbench reanalyze --results results/kits23_modal.jsonThe live leaderboard: https://rubenhaisma.github.io/segauge-benchmark/
Early preview, but already real and multi-model. The harness has adapters for four genuinely different whole-body CT models, run on GPU:
- TotalSegmentator (the de-facto nnU-Net-based reference)
- MOOSE (
clin_ct_organs, from the nuclear-medicine / PET-CT world — an independent lineage) - CT-FM (a SegResNet foundation model)
- MONAI wholeBody (the framework's own bundle)
The current rendered KiTS run scores the first three on whole-kidney segmentation over 20 cases. MONAI is validated but was dropped from this N=20 render by an out-of-memory that is now fixed; it returns on the next run. At N=20 the three rendered models are statistically indistinguishable on kidney (every pairwise difference has a confidence interval through zero, after a Bonferroni correction for the three comparisons) — which is precisely the point: a bare Dice table would have declared a "winner". Next steps: the clean four-model re-run, the AMOS22 multi-organ run (15 organs, per-organ failure slicing), and adding STU-Net.
Inference runs on Modal serverless GPUs, one isolated image per model:
# Multi-model run on KiTS23 (downloads cases into a Modal Volume, runs each model on a GPU):
uv run modal run modal_app.py::main --config configs/kits23_modal.yaml --n-cases 20
# One-time AMOS22 download into the volume, then the multi-organ run:
uv run modal run modal_app.py::prep_amos
uv run modal run modal_app.py::main --config configs/amos_modal.yaml --n-cases 20The harness also runs locally on CPU for a single model (uv run segbench run --config configs/kits23_seed.yaml).
- A dataset adapter (
src/segbench/datasets/) yields cases (image, ground-truth label map, metadata) and declares its label schema and license. - A model adapter (
src/segbench/models/) runs inference in its own environment and returns one mask per organ it supports, remapped into the dataset's label schema by organ name. The harness never imports torch. - The runner (
src/segbench/run.py) scores every supported organ with segauge, spot-checks each model so a misaligned adapter cannot silently publish zeros, and writes one results JSON that fully describes the run.
See the methodology page for the contamination policy, the statistics, and the per-model label mapping.
The leaderboard publishes derived scores, never re-hosted images. It stays non-commercial where a dataset's license (e.g. KiTS23, CC BY-NC-SA 4.0) requires it, and cites every dataset. The commercially-safe (CC-BY family) datasets used or planned are AMOS22, the TotalSegmentator dataset, MSD, and BTCV.
Apache-2.0. Built on segauge.