How much does your image classifier degrade under corruption — and which corruption breaks it first?
A small, referenceable corruption-robustness benchmark in the spirit of ImageNet-C / CIFAR-10-C: train on clean data, grade the same model under a battery of named corruptions (gaussian_noise, blur, rotate, occlusion, pixelate, …) at severities 1-5, and report one headline number — the robustness gap (clean accuracy minus mean corrupted accuracy). Bring your own classifier.
CLI-first, --json on every command, load-bearing exit codes — so a coding agent (Claude Code, Codex, Cursor) can grade a model and parse the full degradation matrix with no UI.
computer-vision · robustness · benchmark · distribution-shift · imagenet-c · cifar-10-c · model-evaluation · cli · agents
One command produces the standard robustness report. On scikit-learn digits with an SVC (clean accuracy 0.988):
clean 0.988 → mean corrupted 0.771 robustness gap +0.217 (relative 0.78)
most robust to: quantize 0.98 · shot_noise 0.98 · gaussian_noise 0.95
most fragile to: pixelate 0.42 · translate 0.59 · rotate 0.61
The full corruption × severity matrix is in the --json output, so you can see the exact severity at which each corruption collapses the model (e.g. SVC on digits survives blur to severity 3, then falls off a cliff).
Run compare on two models and you get the point of the whole exercise:
| model | clean acc | robustness gap |
|---|---|---|
| SVC | 0.988 | +0.217 |
| KNN | 0.980 | +0.183 |
The higher-clean-accuracy model (SVC) is the less robust one. A leaderboard that ranks on clean accuracy would pick the more fragile model. That's exactly why robustness needs its own number — and why this benchmark ranks on the gap, not on clean accuracy.
Needs uv. The verified path is CPU-only, no torch, ~1 second.
uv sync --extra dev
uv run robustness-eval doctor --json
uv run robustness-eval corruptions --json # the benchmark's axes
uv run robustness-eval run configs/digits-robustness.yaml --json # grade a model
uv run robustness-eval compare configs/digits-robustness.yaml configs/digits-knn.yaml --jsonOr run it as a tool with no clone:
uvx --from git+https://github.com/RubenHaisma/robustness-eval robustness-eval corruptions --json- Train the classifier on the clean training split.
- For each
(corruption, severity), apply the corruption to a copy of the clean test split and measure accuracy. Corruptions are test-time only and deterministic given the seed — so this measures genuine distribution-shift robustness, not training augmentation, and the whole report reproduces. - Aggregate into: clean accuracy, the corruption × severity matrix, mean corrupted accuracy, the robustness gap (
clean − mean_corrupted), and relative robustness (mean_corrupted / clean, so models of different clean accuracy compare fairly).
The corruption registry is in lib/corrupt.py — channel-agnostic (grayscale or colour), severity-parameterised, seeded. Add a corruption by writing (img, severity, rng) -> img and registering it; the benchmark and the corruptions spec pick it up automatically.
robustness-eval run <config> reads backend: and routes — same --json shape either way.
| backend | classifier | compute | status |
|---|---|---|---|
sklearn |
a small scikit-learn model on digits |
CPU, in-process | ✅ verified in CI, ~1s |
cnn |
a small torch CNN on CIFAR-10 | compute: modal (rented GPU) or compute: local (your GPU) |
🔌 wired; needs a GPU (not run in CI) |
uv run robustness-eval run configs/cifar-robustness.yaml --dry-run --json # inspect the GPU plan, no spend| Claim | How it's checked | Status |
|---|---|---|
| Corruptions degrade accuracy (positive gap) | test_corruptions_degrade (load-bearing invariant) |
✅ CI |
| Higher severity hurts more | test_higher_severity_hurts_more |
✅ CI |
| The benchmark is deterministic (seed → identical report) | make repro (runs twice, asserts equal) |
✅ CI |
Every corruption stays in [0,1] and preserves shape |
test_all_corruptions_preserve_shape_and_range |
✅ CI |
| The README quickstart still runs | make readme runs the <!-- ci-test --> block |
✅ CI |
| Live metrics posted to the run summary | CI runs the benchmark + ci_report.py |
✅ CI |
| CNN/CIFAR robustness on a real GPU | scripts/modal_cnn.py |
🔌 wired, needs a GPU |
Every command is non-interactive, takes --json, and uses load-bearing exit codes. The contract: with --json, stdout is exactly one JSON object (success or {"ok": false, "error": "..."}); exit 0 = success, non-zero = failure with one stderr line. Parse stdout, branch on the exit code. Agent instructions live in AGENTS.md (CLAUDE.md is a symlink to it).
robustness-eval run my-model.yaml --json | jq '.metrics.robustness_gap' # one number to gate a regressionsrc/robustness_eval/
cli.py # Typer app, one command per capability
output.py # emit()/fail() — output + exit-code contract
commands/ # doctor, run, corruptions, compare, version
lib/
corrupt.py # the corruption registry (the benchmark axes)
evaluate.py # train clean, grade under corruption → the report
data.py # digits train/clean-test split
backends.py # sklearn (verified) | cnn (GPU) dispatch
cnn_runner.py # the torch/CIFAR engine (gpu extra)
configs/ # digits-robustness.yaml (verified) · cifar-robustness.yaml (GPU)
scripts/ # modal_cnn.py + stdlib CI helpers
notebooks/ # marimo (.py) — heatmap + corruption strength vs. fragility
A portfolio piece in computer-vision evaluation that argues a real point: a single accuracy number hides fragility, and the model that wins on clean data can lose under shift. Built in the same house style as rl-augment (which searches augmentations to help training — the natural complement to measuring robustness to corruption) and rl-studio.
Apache-2.0.