From 9819ac577929186635136588f0a64093d289d609 Mon Sep 17 00:00:00 2001 From: AJ Barea Date: Mon, 1 Jun 2026 07:18:29 -0400 Subject: [PATCH 1/2] =?UTF-8?q?feat(reproduce):=20velocity=20reproduce=20?= =?UTF-8?q?=E2=80=94=20re-run=20an=20archived=20sweep?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The inverse of `velocity archive`, closing the reproducibility loop. read_archive recovers the per-run RunSpecs (and original comparison.json) straight from the crate zip; reproduce_archive re-runs them via the existing run_sweep (DRY). `velocity reproduce [--out] [--check] [--tolerance]`. --check compares each run's reproduced final loss against the archived value within a relative tolerance and exits non-zero on a real mismatch. Tolerance-based, not bit-exact (ML/float aggregation isn't bitwise reproducible across runs/hardware) and nan-safe (pydantic serializes an in-memory nan loss to JSON null, so a present run with null loss is read as nan and doesn't false-mismatch a reproduced nan). Documented in docs/cli.md (roster guard). TDD throughout; full suite 379 passed. research(2026-06): ACM/NISO "reproduced" = same config+code re-executed; numerical non-determinism in ML -> tolerance not equality. --- IMPL.md | 48 +++++++++-------- ROADMAP.md | 20 +++++--- docs/cli.md | 23 +++++++++ python/velocity/archive.py | 102 ++++++++++++++++++++++++++++++++++++- python/velocity/cli.py | 45 ++++++++++++++++ tests/test_archive.py | 91 +++++++++++++++++++++++++++++++++ 6 files changed, 297 insertions(+), 32 deletions(-) diff --git a/IMPL.md b/IMPL.md index e1d3896..b412ade 100644 --- a/IMPL.md +++ b/IMPL.md @@ -1,34 +1,32 @@ -# IMPL: reproducibility archive shipped — `velocity reproduce` next +# IMPL: reproducibility loop complete — nothing in flight Session-by-session checklist for what's actively in flight. Long-horizon planning lives in [ROADMAP.md](ROADMAP.md). ## In flight -_Nothing open._ The **`velocity archive`** reproducibility-archive generator -shipped 2026-06-01 (ROADMAP → Completed): it packages a `velocity sweep` output -directory into a single-file **RO-Crate** (Process Run Crate profile, `.zip`) — -the sweep artifacts plus a `uv.lock` snapshot (fallback `installed-packages.txt`), -a how-to-reproduce `README.md`, and a hand-rolled, spec-conformant -`ro-crate-metadata.json` — with **zero new dependency**. New importable -`velocity.archive` module; `velocity archive [-o] [--lockfile]` CLI, -reusing the sweep-time `capture_manifest()` provenance (DRY). - -The web-search at the format decision point paid off: the ROADMAP had guessed a -hand-rolled `.tar.gz`, but RO-Crate (Process Run Crate) is the 2026 standard for -packaging a run with machine-readable provenance, and the spec lets us emit the -JSON-LD by hand (no `rocrate` dep). Also corrected the ROADMAP's inaccurate -`velocity run --save-reproducible-archive` host — `velocity run` is -stateless/seedless, so the archive operates on `velocity sweep` output instead. - -## Next pickup - -- **`velocity reproduce `** (ROADMAP → Audit-of-audit follow-ups) — - the inverse of `archive`: unpack a crate, recover the per-run `RunSpec`s from - the bundled `config.json`s, re-run via `run_sweep`, diff against the bundled - results. The archive's README already documents the manual round-trip (this - session verified it executes); this automates it. Offline-testable via the - demo/stub server — no GPU/HF. +_Nothing open._ The reproducibility loop is complete (both shipped 2026-06-01, +ROADMAP → Completed): + +- **`velocity archive `** — package a sweep output into a single-file + RO-Crate (Process Run Crate, `.zip`): the sweep artifacts plus a `uv.lock` + snapshot, a how-to-reproduce README, and a spec-conformant + `ro-crate-metadata.json`. Zero new dependency (hand-rolled JSON-LD). +- **`velocity reproduce [--check]`** — the inverse: recover the + per-run `RunSpec`s from the crate, re-run via `run_sweep`, and (with `--check`) + verify each run's final loss against the archived value within a relative + tolerance (not bit-exact; nan-safe). Exits non-zero on a real mismatch. + +Both are CLI-only (no MCP-contract churn), built on the existing `sweep` +machinery (DRY), and live in `velocity.archive`. + +## Next up (queued, not active) + +See ROADMAP. The clean CLI-only runway around the sweep/archive area is tapped; +what remains needs design or research judgment rather than execution — the +leaderboard read-side items (cross-config normalisation, the LLM-backed A2A +auditors: convergence / robustness / hyperparameter) and the research-tier +streams (server-side DP, streaming aggregation, compression). When picking one up, replace this file with a full session plan (Why / Decisions / Scope / Out of scope / Definition of done). diff --git a/ROADMAP.md b/ROADMAP.md index 1befa17..2fc3e14 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -407,12 +407,6 @@ rather than the curated, dumped arena CSV the first cut renders. > Source: 2026-05-21 audit-of-audits review (deleted after extraction). Items that survived verdict review but don't fit Compression / Privacy / Streaming cleanly. -- **`velocity reproduce `** — the inverse of the now-shipped - `velocity archive` (see Completed): unpack a reproducibility crate, recover the - per-run `RunSpec`s from the bundled `config.json`s, re-run via `run_sweep`, and - diff against the bundled results. The archive's README already documents the - manual round-trip (verified working end-to-end); this automates it. Cheaply - testable — the demo/stub server runs offline, no GPU/HF. Tier 1. - **Cross-silo Pareto benchmark suite** — power-law (Pareto: 20% clients hold 80% data) realistic distribution as a benchmark axis alongside the existing IID + Dirichlet partitioners. Measure convergence + per-client accuracy variance + robustness-under-attack on the same skew. Real FL deployments are cross-silo, not equal-sized; benchmarks should reflect that. Fits under `## Performance`. Tier 3 research. ## Cross-sister polish (2026-05-21) @@ -447,6 +441,20 @@ a dash is illegal. Only display/brand prose is "Velocity-FL". Authoritative records: git history, `docs/benchmarks.md`, `docs/convergence.md`, `docs/strategies.md`. This index is pruned once work is durably shipped. +- 2026-06-01 — **`velocity reproduce` — re-run an archived sweep (closes the loop).** + The inverse of `velocity archive`: `read_archive` recovers the per-run `RunSpec`s + (and the original `comparison.json`) straight from the crate zip; `reproduce_archive` + re-runs them through the existing `run_sweep` (DRY). `velocity reproduce + [--out] [--check] [--tolerance]`; `--check` compares each run's reproduced final loss + against the archived value within a relative tolerance and exits non-zero on a real + mismatch. Tolerance-based, **not** bit-exact — ML / float aggregation isn't bitwise + reproducible across runs/hardware — and nan-safe (pydantic serializes an in-memory nan + loss to JSON null, so a present run with null loss is read as nan and doesn't + false-mismatch a reproduced nan; an absent run is a real mismatch). TDD (spec recovery + + end-to-end re-run on the offline stub + tolerance/nan logic + CLI); full suite + lint + green; documented in docs/cli.md (roster guard). research(2026-06): ACM/NISO "reproduced" + = same config + code re-executed (acm.org/publications/badging-terms); numerical + non-determinism in ML → tolerance not equality (arXiv:2506.09501, arXiv:2302.12691). - 2026-06-01 — **Reproducibility archive generator (`velocity archive`).** Packages a sweep output directory into a single-file **RO-Crate** (Process Run Crate profile, `.zip`): the existing sweep artifacts (`manifest.json`, `comparison.{json,md}`, diff --git a/docs/cli.md b/docs/cli.md index c82b4f3..e8172c9 100644 --- a/docs/cli.md +++ b/docs/cli.md @@ -173,6 +173,29 @@ uv run velocity archive out/20260601T100000Z-sweep -o paper-artifact.zip --- +## `velocity reproduce` + +Re-runs an archived sweep — a *reproduction* (same config and code, re-executed) of a bundle produced by [`velocity archive`](#velocity-archive). Optionally verifies the re-run matches the archived results. + +```bash +# Re-run an archive into out/-reproduce +uv run velocity reproduce paper-artifact.zip + +# Re-run and verify results match the archived ones, within a relative tolerance +uv run velocity reproduce paper-artifact.zip --check --tolerance 1e-4 +``` + +| Argument / Option | Type | Default | Description | +|---|---|---|---| +| `ARCHIVE` (positional) | `path` | none | A reproducibility archive `.zip` from `velocity archive`. | +| `--out` | `path` | `out/-reproduce` | Output directory for the re-run. | +| `--check` | flag | off | Compare reproduced per-run final loss against the archived values. | +| `--tolerance` | `float` | `1e-6` | Relative tolerance for `--check`. Not bit-exact — float aggregation isn't bitwise reproducible across runs/hardware. | + +**Output** — a fresh sweep output directory. With `--check`, a per-run `ok` / `MISMATCH` report that exits non-zero if any run's result diverges beyond tolerance. + +--- + ## Exit codes | Code | Meaning | diff --git a/python/velocity/archive.py b/python/velocity/archive.py index 4e1bfc4..a0be493 100644 --- a/python/velocity/archive.py +++ b/python/velocity/archive.py @@ -22,10 +22,15 @@ import importlib.metadata import json +import math import zipfile +from dataclasses import dataclass from datetime import UTC, datetime from pathlib import Path -from typing import Any +from typing import TYPE_CHECKING, Any + +if TYPE_CHECKING: + from velocity.sweep import RunSpec, SweepResult RO_CRATE_CONTEXT = "https://w3id.org/ro/crate/1.1/context" RO_CRATE_PROFILE = "https://w3id.org/ro/crate/1.1" @@ -250,3 +255,98 @@ def create_archive( zf.writestr("README.md", readme) zf.writestr("ro-crate-metadata.json", json.dumps(metadata, indent=2)) return archive_path + + +@dataclass +class ArchiveContents: + """What `read_archive` recovers from a reproducibility crate.""" + + specs: list[RunSpec] + original: dict[str, Any] | None # parsed comparison.json (a SweepResult dump), if present + + +def read_archive(archive_path: Path) -> ArchiveContents: + """Recover the per-run `RunSpec`s (and original results) from a crate zip. + + Reads the bundled `/config.json` entries and `comparison.json` directly + from the archive — no extraction to disk needed to inspect or re-run it. + """ + from velocity.sweep import RunSpec + + archive_path = Path(archive_path) + specs: list[RunSpec] = [] + original: dict[str, Any] | None = None + with zipfile.ZipFile(archive_path) as zf: + names = set(zf.namelist()) + for name in sorted(n for n in names if n.endswith("/config.json")): + specs.append(RunSpec.model_validate_json(zf.read(name))) + if "comparison.json" in names: + original = json.loads(zf.read("comparison.json")) + if not specs: + raise ValueError(f"{archive_path} has no /config.json — not a velocity archive") + return ArchiveContents(specs=specs, original=original) + + +def reproduce_archive(archive_path: Path, *, out_dir: Path) -> SweepResult: + """Re-run an archived sweep from its bundled configs into ``out_dir``. + + A reproduction in the ACM/NISO sense: same configs + code, re-executed. Reuses + the existing ``run_sweep`` runner (DRY) over the recovered specs. + """ + from velocity.sweep import run_sweep + + contents = read_archive(archive_path) + return run_sweep(contents.specs, out_dir=Path(out_dir)) + + +@dataclass +class ResultDiff: + """One run's archived vs reproduced final loss, and whether they agree.""" + + name: str + original: float | None + reproduced: float + ok: bool + + +def _loss_close(a: float | None, b: float, rel_tol: float) -> bool: + if a is None: + return False # run absent from the original results + a_nan = isinstance(a, float) and a != a + b_nan = isinstance(b, float) and b != b + if a_nan and b_nan: + return True # both undefined → not a mismatch (the offline stub's losses are nan) + if a_nan or b_nan: + return False + return math.isclose(a, b, rel_tol=rel_tol, abs_tol=0.0) + + +def compare_results( + original: dict[str, Any], reproduced: SweepResult, *, rel_tol: float = 1e-6 +) -> list[ResultDiff]: + """Per-run final-loss agreement within a relative tolerance (nan-safe). + + Tolerance-based, not bit-exact: ML / float aggregation is not bitwise + deterministic across runs and hardware, so asserting equality would emit false + failures. ``original`` is a parsed ``comparison.json`` (a ``SweepResult`` dump). + """ + orig_runs = {r["spec"]["name"]: r for r in original.get("runs", [])} + diffs: list[ResultDiff] = [] + for run in reproduced.runs: + name = run.spec.name + if name in orig_runs: + raw = orig_runs[name].get("final_loss") + # pydantic serializes an in-memory nan loss to JSON null, so a present + # run with null final_loss means nan, not "missing". + original_loss = float("nan") if raw is None else raw + else: + original_loss = None # run absent from the archived results + diffs.append( + ResultDiff( + name=name, + original=original_loss, + reproduced=run.final_loss, + ok=_loss_close(original_loss, run.final_loss, rel_tol), + ) + ) + return diffs diff --git a/python/velocity/cli.py b/python/velocity/cli.py index 64b3279..5d392f1 100644 --- a/python/velocity/cli.py +++ b/python/velocity/cli.py @@ -466,3 +466,48 @@ def archive( except ValueError as e: raise typer.BadParameter(str(e)) from e typer.echo(f"Wrote {written}") + + +@app.command() +def reproduce( + archive: Path = typer.Argument( # noqa: B008 + ..., help="A reproducibility archive (.zip produced by `velocity archive`)." + ), + out: Path | None = typer.Option( # noqa: B008 + None, "--out", help="Output dir for the re-run (default: out/-reproduce)." + ), + check: bool = typer.Option( + False, "--check", help="Compare the re-run's results against the archived ones." + ), + tolerance: float = typer.Option( + 1e-6, "--tolerance", help="Relative tolerance for --check (not bit-exact)." + ), +) -> None: + """Re-run an archived sweep — a reproduction (same config + code, re-executed). + + With ``--check``, compares each run's reproduced final loss against the + archived value within ``--tolerance`` (tolerance-based, since float + aggregation isn't bitwise reproducible) and exits non-zero on any mismatch. + """ + from velocity.archive import compare_results, read_archive, reproduce_archive + + if out is None: + ts = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ") + out = Path("out") / f"{ts}-reproduce" + try: + result = reproduce_archive(archive, out_dir=out) + except (ValueError, FileNotFoundError) as e: + raise typer.BadParameter(str(e)) from e + typer.echo(f"Reproduced {len(result.runs)} run(s) into {out}") + + if check: + original = read_archive(archive).original + if original is None: + typer.echo("No archived results to check against (comparison.json absent).") + return + diffs = compare_results(original, result, rel_tol=tolerance) + for d in diffs: + mark = "ok" if d.ok else "MISMATCH" + typer.echo(f" [{mark}] {d.name}: archived={d.original} reproduced={d.reproduced}") + if any(not d.ok for d in diffs): + raise typer.Exit(code=1) diff --git a/tests/test_archive.py b/tests/test_archive.py index edcfe91..be9411f 100644 --- a/tests/test_archive.py +++ b/tests/test_archive.py @@ -13,7 +13,10 @@ RO_CRATE_CONTEXT, RO_CRATE_PROFILE, build_ro_crate_metadata, + compare_results, create_archive, + read_archive, + reproduce_archive, ) from velocity.cli import app @@ -150,3 +153,91 @@ def test_cli_archive_command_writes_zip(tmp_path): assert str(dest) in result.output with zipfile.ZipFile(dest) as zf: assert "ro-crate-metadata.json" in zf.namelist() + + +def _real_sweep(tmp_path: Path, seed: int = 7) -> Path: + """A real (offline-stub) sweep dir with valid RunSpec config.json files.""" + from velocity.strategy import FedAvg + from velocity.sweep import RunSpec, run_sweep + + out = tmp_path / "orig-sweep" + run_sweep( + [RunSpec(name="fedavg-baseline", strategy=FedAvg(), rounds=1, min_clients=1, seed=seed)], + out_dir=out, + ) + return out + + +def test_read_archive_recovers_specs(tmp_path): + archive = create_archive( + _real_sweep(tmp_path, seed=7), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none" + ) + contents = read_archive(archive) + assert [s.name for s in contents.specs] == ["fedavg-baseline"] + assert contents.specs[0].seed == 7 + assert type(contents.specs[0].strategy).__name__ == "FedAvg" + assert contents.original is not None # comparison.json recovered for --check + + +def test_reproduce_archive_reruns(tmp_path): + archive = create_archive( + _real_sweep(tmp_path, seed=7), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none" + ) + result = reproduce_archive(archive, out_dir=tmp_path / "repro") + assert {r.spec.name for r in result.runs} == {"fedavg-baseline"} + assert (tmp_path / "repro" / "comparison.json").exists() # fresh sweep output written + + +def test_compare_results_tolerance_and_nan(): + from velocity.strategy import FedAvg + from velocity.sweep import RunResult, RunSpec, SweepResult + + def mk(name: str, loss: float) -> RunResult: + return RunResult( + spec=RunSpec(name=name, strategy=FedAvg()), + rounds=[], + final_loss=loss, + mean_loss=loss, + elapsed_seconds=0.0, + ) + + reproduced = SweepResult( + runs=[mk("a", 0.5000001), mk("b", 9.9), mk("c", float("nan"))], + total_elapsed=1.0, + serial_elapsed=1.0, + parallel=1, + out_dir="x", + ) + original = { + "runs": [ + {"spec": {"name": "a"}, "final_loss": 0.5}, + {"spec": {"name": "b"}, "final_loss": 1.0}, + {"spec": {"name": "c"}, "final_loss": float("nan")}, + ] + } + diffs = {d.name: d for d in compare_results(original, reproduced, rel_tol=1e-3)} + assert diffs["a"].ok # 0.5000001 vs 0.5 within 1e-3 + assert not diffs["b"].ok # 9.9 vs 1.0 — clear mismatch + assert diffs["c"].ok # both nan → undefined, not a mismatch (stub case) + + +def test_cli_reproduce_command(tmp_path): + archive = create_archive( + _real_sweep(tmp_path, seed=3), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none" + ) + out = tmp_path / "repro-out" + result = CliRunner().invoke(app, ["reproduce", str(archive), "--out", str(out)]) + assert result.exit_code == 0, result.output + assert (out / "comparison.json").exists() + + +def test_cli_reproduce_check_reports(tmp_path): + archive = create_archive( + _real_sweep(tmp_path, seed=3), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none" + ) + result = CliRunner().invoke( + app, ["reproduce", str(archive), "--out", str(tmp_path / "r"), "--check"] + ) + # Stub losses are nan (both-nan = ok), so --check passes and prints a report. + assert result.exit_code == 0, result.output + assert "fedavg-baseline" in result.output From 45b9d0096505b2e03dc3ea4d7a4729842e7a1d44 Mon Sep 17 00:00:00 2001 From: AJ Barea Date: Mon, 1 Jun 2026 07:21:17 -0400 Subject: [PATCH 2/2] docs(readme): list archive + reproduce in CLI reference + quickstart --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 703f2b3..beac35c 100644 --- a/README.md +++ b/README.md @@ -124,6 +124,8 @@ velocity run --model-id test/model --dataset test/dataset --rounds 1 --min-clien velocity simulate-attack model_poisoning --intensity 0.2 velocity sweep --strategies FedAvg,Krum --attacks gaussian_noise --rounds 5 velocity leaderboard --metric robustness +velocity archive out/-sweep -o run.crate.zip # bundle a sweep into a reproducibility archive +velocity reproduce run.crate.zip --check # re-run it elsewhere and verify results match ``` --- @@ -136,6 +138,8 @@ velocity leaderboard --metric robustness - `velocity simulate-attack ...` — register one attack and run a round - `velocity sweep ...` — run a strategy × attack matrix across seeds (see [`docs/sweep-spec.md`](docs/sweep-spec.md)) - `velocity leaderboard ...` — rank stored runs (accuracy / rounds-to-target / wall-clock / comm-cost / pareto / pareto-slices / robustness) +- `velocity archive ...` — package a sweep output into a single-file reproducibility archive (RO-Crate) +- `velocity reproduce ...` — re-run an archived sweep (`--check` verifies results within tolerance) Full reference: [`docs/cli.md`](docs/cli.md)