Skip to content

RubenHaisma/robustness-eval

Repository files navigation

robustness-eval

How much does your image classifier degrade under corruption — and which corruption breaks it first?

A small, referenceable corruption-robustness benchmark in the spirit of ImageNet-C / CIFAR-10-C: train on clean data, grade the same model under a battery of named corruptions (gaussian_noise, blur, rotate, occlusion, pixelate, …) at severities 1-5, and report one headline number — the robustness gap (clean accuracy minus mean corrupted accuracy). Bring your own classifier.

CLI-first, --json on every command, load-bearing exit codes — so a coding agent (Claude Code, Codex, Cursor) can grade a model and parse the full degradation matrix with no UI.

computer-vision · robustness · benchmark · distribution-shift · imagenet-c · cifar-10-c · model-evaluation · cli · agents


What you get

One command produces the standard robustness report. On scikit-learn digits with an SVC (clean accuracy 0.988):

clean 0.988  →  mean corrupted 0.771   robustness gap +0.217   (relative 0.78)

most robust to:   quantize 0.98 · shot_noise 0.98 · gaussian_noise 0.95
most fragile to:  pixelate 0.42 · translate 0.59 · rotate 0.61

The full corruption × severity matrix is in the --json output, so you can see the exact severity at which each corruption collapses the model (e.g. SVC on digits survives blur to severity 3, then falls off a cliff).

A finding worth the repo

Run compare on two models and you get the point of the whole exercise:

model clean acc robustness gap
SVC 0.988 +0.217
KNN 0.980 +0.183

The higher-clean-accuracy model (SVC) is the less robust one. A leaderboard that ranks on clean accuracy would pick the more fragile model. That's exactly why robustness needs its own number — and why this benchmark ranks on the gap, not on clean accuracy.


Quickstart

Needs uv. The verified path is CPU-only, no torch, ~1 second.

uv sync --extra dev
uv run robustness-eval doctor --json
uv run robustness-eval corruptions --json                              # the benchmark's axes
uv run robustness-eval run configs/digits-robustness.yaml --json       # grade a model
uv run robustness-eval compare configs/digits-robustness.yaml configs/digits-knn.yaml --json

Or run it as a tool with no clone:

uvx --from git+https://github.com/RubenHaisma/robustness-eval robustness-eval corruptions --json

How it works

  1. Train the classifier on the clean training split.
  2. For each (corruption, severity), apply the corruption to a copy of the clean test split and measure accuracy. Corruptions are test-time only and deterministic given the seed — so this measures genuine distribution-shift robustness, not training augmentation, and the whole report reproduces.
  3. Aggregate into: clean accuracy, the corruption × severity matrix, mean corrupted accuracy, the robustness gap (clean − mean_corrupted), and relative robustness (mean_corrupted / clean, so models of different clean accuracy compare fairly).

The corruption registry is in lib/corrupt.py — channel-agnostic (grayscale or colour), severity-parameterised, seeded. Add a corruption by writing (img, severity, rng) -> img and registering it; the benchmark and the corruptions spec pick it up automatically.


Backends

robustness-eval run <config> reads backend: and routes — same --json shape either way.

backend classifier compute status
sklearn a small scikit-learn model on digits CPU, in-process verified in CI, ~1s
cnn a small torch CNN on CIFAR-10 compute: modal (rented GPU) or compute: local (your GPU) 🔌 wired; needs a GPU (not run in CI)
uv run robustness-eval run configs/cifar-robustness.yaml --dry-run --json   # inspect the GPU plan, no spend

What's verified

Claim How it's checked Status
Corruptions degrade accuracy (positive gap) test_corruptions_degrade (load-bearing invariant) ✅ CI
Higher severity hurts more test_higher_severity_hurts_more ✅ CI
The benchmark is deterministic (seed → identical report) make repro (runs twice, asserts equal) ✅ CI
Every corruption stays in [0,1] and preserves shape test_all_corruptions_preserve_shape_and_range ✅ CI
The README quickstart still runs make readme runs the <!-- ci-test --> block ✅ CI
Live metrics posted to the run summary CI runs the benchmark + ci_report.py ✅ CI
CNN/CIFAR robustness on a real GPU scripts/modal_cnn.py 🔌 wired, needs a GPU

Drive it with Claude Code (or any agent)

Every command is non-interactive, takes --json, and uses load-bearing exit codes. The contract: with --json, stdout is exactly one JSON object (success or {"ok": false, "error": "..."}); exit 0 = success, non-zero = failure with one stderr line. Parse stdout, branch on the exit code. Agent instructions live in AGENTS.md (CLAUDE.md is a symlink to it).

robustness-eval run my-model.yaml --json | jq '.metrics.robustness_gap'   # one number to gate a regression

Layout

src/robustness_eval/
  cli.py           # Typer app, one command per capability
  output.py        # emit()/fail() — output + exit-code contract
  commands/        # doctor, run, corruptions, compare, version
  lib/
    corrupt.py     # the corruption registry (the benchmark axes)
    evaluate.py    # train clean, grade under corruption → the report
    data.py        # digits train/clean-test split
    backends.py    # sklearn (verified) | cnn (GPU) dispatch
    cnn_runner.py  # the torch/CIFAR engine (gpu extra)
configs/           # digits-robustness.yaml (verified) · cifar-robustness.yaml (GPU)
scripts/           # modal_cnn.py + stdlib CI helpers
notebooks/         # marimo (.py) — heatmap + corruption strength vs. fragility

Why this exists

A portfolio piece in computer-vision evaluation that argues a real point: a single accuracy number hides fragility, and the model that wins on clean data can lose under shift. Built in the same house style as rl-augment (which searches augmentations to help training — the natural complement to measuring robustness to corruption) and rl-studio.

License

Apache-2.0.

About

A small, referenceable corruption-robustness benchmark for image classifiers (ImageNet-C / CIFAR-10-C style). Train clean, grade under corruption × severity, report the robustness gap. Agent-drivable --json CLI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors