robustness-eval

How much does your image classifier degrade under corruption — and which corruption breaks it first?

A small, referenceable corruption-robustness benchmark in the spirit of ImageNet-C / CIFAR-10-C: train on clean data, grade the same model under a battery of named corruptions (gaussian_noise, blur, rotate, occlusion, pixelate, …) at severities 1-5, and report one headline number — the robustness gap (clean accuracy minus mean corrupted accuracy). Bring your own classifier.

CLI-first, --json on every command, load-bearing exit codes — so a coding agent (Claude Code, Codex, Cursor) can grade a model and parse the full degradation matrix with no UI.

computer-vision · robustness · benchmark · distribution-shift · imagenet-c · cifar-10-c · model-evaluation · cli · agents

What you get

One command produces the standard robustness report. On scikit-learn digits with an SVC (clean accuracy 0.988):

clean 0.988  →  mean corrupted 0.771   robustness gap +0.217   (relative 0.78)

most robust to:   quantize 0.98 · shot_noise 0.98 · gaussian_noise 0.95
most fragile to:  pixelate 0.42 · translate 0.59 · rotate 0.61

The full corruption × severity matrix is in the --json output, so you can see the exact severity at which each corruption collapses the model (e.g. SVC on digits survives blur to severity 3, then falls off a cliff).

A finding worth the repo

Run compare on two models and you get the point of the whole exercise:

model	clean acc	robustness gap
SVC	0.988	+0.217
KNN	0.980	+0.183

The higher-clean-accuracy model (SVC) is the less robust one. A leaderboard that ranks on clean accuracy would pick the more fragile model. That's exactly why robustness needs its own number — and why this benchmark ranks on the gap, not on clean accuracy.

Quickstart

Needs uv. The verified path is CPU-only, no torch, ~1 second.

uv sync --extra dev
uv run robustness-eval doctor --json
uv run robustness-eval corruptions --json                              # the benchmark's axes
uv run robustness-eval run configs/digits-robustness.yaml --json       # grade a model
uv run robustness-eval compare configs/digits-robustness.yaml configs/digits-knn.yaml --json

Or run it as a tool with no clone:

uvx --from git+https://github.com/RubenHaisma/robustness-eval robustness-eval corruptions --json

How it works

Train the classifier on the clean training split.
For each (corruption, severity), apply the corruption to a copy of the clean test split and measure accuracy. Corruptions are test-time only and deterministic given the seed — so this measures genuine distribution-shift robustness, not training augmentation, and the whole report reproduces.
Aggregate into: clean accuracy, the corruption × severity matrix, mean corrupted accuracy, the robustness gap (clean − mean_corrupted), and relative robustness (mean_corrupted / clean, so models of different clean accuracy compare fairly).

The corruption registry is in lib/corrupt.py — channel-agnostic (grayscale or colour), severity-parameterised, seeded. Add a corruption by writing (img, severity, rng) -> img and registering it; the benchmark and the corruptions spec pick it up automatically.

Backends

robustness-eval run <config> reads backend: and routes — same --json shape either way.

backend	classifier	compute	status
`sklearn`	a small scikit-learn model on `digits`	CPU, in-process	✅ verified in CI, ~1s
`cnn`	a small torch CNN on CIFAR-10	`compute: modal` (rented GPU) or `compute: local` (your GPU)	🔌 wired; needs a GPU (not run in CI)

uv run robustness-eval run configs/cifar-robustness.yaml --dry-run --json   # inspect the GPU plan, no spend

What's verified

Claim	How it's checked	Status
Corruptions degrade accuracy (positive gap)	`test_corruptions_degrade` (load-bearing invariant)	✅ CI
Higher severity hurts more	`test_higher_severity_hurts_more`	✅ CI
The benchmark is deterministic (seed → identical report)	`make repro` (runs twice, asserts equal)	✅ CI
Every corruption stays in `[0,1]` and preserves shape	`test_all_corruptions_preserve_shape_and_range`	✅ CI
The README quickstart still runs	`make readme` runs the `<!-- ci-test -->` block	✅ CI
Live metrics posted to the run summary	CI runs the benchmark + `ci_report.py`	✅ CI
CNN/CIFAR robustness on a real GPU	`scripts/modal_cnn.py`	🔌 wired, needs a GPU

Drive it with Claude Code (or any agent)

Every command is non-interactive, takes --json, and uses load-bearing exit codes. The contract: with --json, stdout is exactly one JSON object (success or {"ok": false, "error": "..."}); exit 0 = success, non-zero = failure with one stderr line. Parse stdout, branch on the exit code. Agent instructions live in AGENTS.md (CLAUDE.md is a symlink to it).

robustness-eval run my-model.yaml --json | jq '.metrics.robustness_gap'   # one number to gate a regression

Layout

src/robustness_eval/
  cli.py           # Typer app, one command per capability
  output.py        # emit()/fail() — output + exit-code contract
  commands/        # doctor, run, corruptions, compare, version
  lib/
    corrupt.py     # the corruption registry (the benchmark axes)
    evaluate.py    # train clean, grade under corruption → the report
    data.py        # digits train/clean-test split
    backends.py    # sklearn (verified) | cnn (GPU) dispatch
    cnn_runner.py  # the torch/CIFAR engine (gpu extra)
configs/           # digits-robustness.yaml (verified) · cifar-robustness.yaml (GPU)
scripts/           # modal_cnn.py + stdlib CI helpers
notebooks/         # marimo (.py) — heatmap + corruption strength vs. fragility

Why this exists

A portfolio piece in computer-vision evaluation that argues a real point: a single accuracy number hides fragility, and the model that wins on clean data can lose under shift. Built in the same house style as rl-augment (which searches augmentations to help training — the natural complement to measuring robustness to corruption) and rl-studio.

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
configs		configs
notebooks		notebooks
results		results
scripts		scripts
src/robustness_eval		src/robustness_eval
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robustness-eval

What you get

A finding worth the repo

Quickstart

How it works

Backends

What's verified

Drive it with Claude Code (or any agent)

Layout

Why this exists

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

robustness-eval

What you get

A finding worth the repo

Quickstart

How it works

Backends

What's verified

Drive it with Claude Code (or any agent)

Layout

Why this exists

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages