From 9819ac577929186635136588f0a64093d289d609 Mon Sep 17 00:00:00 2001
From: AJ Barea <ajbareaa@gmail.com>
Date: Mon, 1 Jun 2026 07:18:29 -0400
Subject: [PATCH 1/2] =?UTF-8?q?feat(reproduce):=20velocity=20reproduce=20?=
 =?UTF-8?q?=E2=80=94=20re-run=20an=20archived=20sweep?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The inverse of `velocity archive`, closing the reproducibility loop.
read_archive recovers the per-run RunSpecs (and original comparison.json)
straight from the crate zip; reproduce_archive re-runs them via the existing
run_sweep (DRY).

`velocity reproduce <archive.zip> [--out] [--check] [--tolerance]`. --check
compares each run's reproduced final loss against the archived value within a
relative tolerance and exits non-zero on a real mismatch. Tolerance-based, not
bit-exact (ML/float aggregation isn't bitwise reproducible across runs/hardware)
and nan-safe (pydantic serializes an in-memory nan loss to JSON null, so a
present run with null loss is read as nan and doesn't false-mismatch a
reproduced nan).

Documented in docs/cli.md (roster guard). TDD throughout; full suite 379 passed.

research(2026-06): ACM/NISO "reproduced" = same config+code re-executed;
numerical non-determinism in ML -> tolerance not equality.
---
 IMPL.md                    |  48 +++++++++--------
 ROADMAP.md                 |  20 +++++---
 docs/cli.md                |  23 +++++++++
 python/velocity/archive.py | 102 ++++++++++++++++++++++++++++++++++++-
 python/velocity/cli.py     |  45 ++++++++++++++++
 tests/test_archive.py      |  91 +++++++++++++++++++++++++++++++++
 6 files changed, 297 insertions(+), 32 deletions(-)
diff --git a/IMPL.md b/IMPL.md
index e1d3896..b412ade 100644
--- a/IMPL.md
+++ b/IMPL.md
@@ -1,34 +1,32 @@
-# IMPL: reproducibility archive shipped — `velocity reproduce` next
+# IMPL: reproducibility loop complete — nothing in flight
 
 Session-by-session checklist for what's actively in flight. Long-horizon
 planning lives in [ROADMAP.md](ROADMAP.md).
 
 ## In flight
 
-_Nothing open._ The **`velocity archive`** reproducibility-archive generator
-shipped 2026-06-01 (ROADMAP → Completed): it packages a `velocity sweep` output
-directory into a single-file **RO-Crate** (Process Run Crate profile, `.zip`) —
-the sweep artifacts plus a `uv.lock` snapshot (fallback `installed-packages.txt`),
-a how-to-reproduce `README.md`, and a hand-rolled, spec-conformant
-`ro-crate-metadata.json` — with **zero new dependency**. New importable
-`velocity.archive` module; `velocity archive <out-dir> [-o] [--lockfile]` CLI,
-reusing the sweep-time `capture_manifest()` provenance (DRY).
-
-The web-search at the format decision point paid off: the ROADMAP had guessed a
-hand-rolled `.tar.gz`, but RO-Crate (Process Run Crate) is the 2026 standard for
-packaging a run with machine-readable provenance, and the spec lets us emit the
-JSON-LD by hand (no `rocrate` dep). Also corrected the ROADMAP's inaccurate
-`velocity run --save-reproducible-archive` host — `velocity run` is
-stateless/seedless, so the archive operates on `velocity sweep` output instead.
-
-## Next pickup
-
-- **`velocity reproduce <archive.zip>`** (ROADMAP → Audit-of-audit follow-ups) —
-  the inverse of `archive`: unpack a crate, recover the per-run `RunSpec`s from
-  the bundled `config.json`s, re-run via `run_sweep`, diff against the bundled
-  results. The archive's README already documents the manual round-trip (this
-  session verified it executes); this automates it. Offline-testable via the
-  demo/stub server — no GPU/HF.
+_Nothing open._ The reproducibility loop is complete (both shipped 2026-06-01,
+ROADMAP → Completed):
+
+- **`velocity archive <out-dir>`** — package a sweep output into a single-file
+  RO-Crate (Process Run Crate, `.zip`): the sweep artifacts plus a `uv.lock`
+  snapshot, a how-to-reproduce README, and a spec-conformant
+  `ro-crate-metadata.json`. Zero new dependency (hand-rolled JSON-LD).
+- **`velocity reproduce <archive.zip> [--check]`** — the inverse: recover the
+  per-run `RunSpec`s from the crate, re-run via `run_sweep`, and (with `--check`)
+  verify each run's final loss against the archived value within a relative
+  tolerance (not bit-exact; nan-safe). Exits non-zero on a real mismatch.
+
+Both are CLI-only (no MCP-contract churn), built on the existing `sweep`
+machinery (DRY), and live in `velocity.archive`.
+
+## Next up (queued, not active)
+
+See ROADMAP. The clean CLI-only runway around the sweep/archive area is tapped;
+what remains needs design or research judgment rather than execution — the
+leaderboard read-side items (cross-config normalisation, the LLM-backed A2A
+auditors: convergence / robustness / hyperparameter) and the research-tier
+streams (server-side DP, streaming aggregation, compression).
 
 When picking one up, replace this file with a full session plan
 (Why / Decisions / Scope / Out of scope / Definition of done).
diff --git a/ROADMAP.md b/ROADMAP.md
index 1befa17..2fc3e14 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -407,12 +407,6 @@ rather than the curated, dumped arena CSV the first cut renders.
 
 > Source: 2026-05-21 audit-of-audits review (deleted after extraction). Items that survived verdict review but don't fit Compression / Privacy / Streaming cleanly.
 
-- **`velocity reproduce <archive.zip>`** — the inverse of the now-shipped
-  `velocity archive` (see Completed): unpack a reproducibility crate, recover the
-  per-run `RunSpec`s from the bundled `config.json`s, re-run via `run_sweep`, and
-  diff against the bundled results. The archive's README already documents the
-  manual round-trip (verified working end-to-end); this automates it. Cheaply
-  testable — the demo/stub server runs offline, no GPU/HF. Tier 1.
 - **Cross-silo Pareto benchmark suite** — power-law (Pareto: 20% clients hold 80% data) realistic distribution as a benchmark axis alongside the existing IID + Dirichlet partitioners. Measure convergence + per-client accuracy variance + robustness-under-attack on the same skew. Real FL deployments are cross-silo, not equal-sized; benchmarks should reflect that. Fits under `## Performance`. Tier 3 research.
 
 ## Cross-sister polish (2026-05-21)
@@ -447,6 +441,20 @@ a dash is illegal. Only display/brand prose is "Velocity-FL".
 
 Authoritative records: git history, `docs/benchmarks.md`, `docs/convergence.md`, `docs/strategies.md`. This index is pruned once work is durably shipped.
 
+- 2026-06-01 — **`velocity reproduce` — re-run an archived sweep (closes the loop).**
+  The inverse of `velocity archive`: `read_archive` recovers the per-run `RunSpec`s
+  (and the original `comparison.json`) straight from the crate zip; `reproduce_archive`
+  re-runs them through the existing `run_sweep` (DRY). `velocity reproduce <archive.zip>
+  [--out] [--check] [--tolerance]`; `--check` compares each run's reproduced final loss
+  against the archived value within a relative tolerance and exits non-zero on a real
+  mismatch. Tolerance-based, **not** bit-exact — ML / float aggregation isn't bitwise
+  reproducible across runs/hardware — and nan-safe (pydantic serializes an in-memory nan
+  loss to JSON null, so a present run with null loss is read as nan and doesn't
+  false-mismatch a reproduced nan; an absent run is a real mismatch). TDD (spec recovery
+  + end-to-end re-run on the offline stub + tolerance/nan logic + CLI); full suite + lint
+  green; documented in docs/cli.md (roster guard). research(2026-06): ACM/NISO "reproduced"
+  = same config + code re-executed (acm.org/publications/badging-terms); numerical
+  non-determinism in ML → tolerance not equality (arXiv:2506.09501, arXiv:2302.12691).
 - 2026-06-01 — **Reproducibility archive generator (`velocity archive`).** Packages a
   sweep output directory into a single-file **RO-Crate** (Process Run Crate profile,
   `.zip`): the existing sweep artifacts (`manifest.json`, `comparison.{json,md}`,
diff --git a/docs/cli.md b/docs/cli.md
index c82b4f3..e8172c9 100644
--- a/docs/cli.md
+++ b/docs/cli.md
@@ -173,6 +173,29 @@ uv run velocity archive out/20260601T100000Z-sweep -o paper-artifact.zip
 
 ---
 
+## `velocity reproduce`
+
+Re-runs an archived sweep — a *reproduction* (same config and code, re-executed) of a bundle produced by [`velocity archive`](#velocity-archive). Optionally verifies the re-run matches the archived results.
+
+```bash
+# Re-run an archive into out/<ts>-reproduce
+uv run velocity reproduce paper-artifact.zip
+
+# Re-run and verify results match the archived ones, within a relative tolerance
+uv run velocity reproduce paper-artifact.zip --check --tolerance 1e-4
+```
+
+| Argument / Option | Type | Default | Description |
+|---|---|---|---|
+| `ARCHIVE` (positional) | `path` | none | A reproducibility archive `.zip` from `velocity archive`. |
+| `--out` | `path` | `out/<timestamp>-reproduce` | Output directory for the re-run. |
+| `--check` | flag | off | Compare reproduced per-run final loss against the archived values. |
+| `--tolerance` | `float` | `1e-6` | Relative tolerance for `--check`. Not bit-exact — float aggregation isn't bitwise reproducible across runs/hardware. |
+
+**Output** — a fresh sweep output directory. With `--check`, a per-run `ok` / `MISMATCH` report that exits non-zero if any run's result diverges beyond tolerance.
+
+---
+
 ## Exit codes
 
 | Code | Meaning |
diff --git a/python/velocity/archive.py b/python/velocity/archive.py
index 4e1bfc4..a0be493 100644
--- a/python/velocity/archive.py
+++ b/python/velocity/archive.py
@@ -22,10 +22,15 @@
 
 import importlib.metadata
 import json
+import math
 import zipfile
+from dataclasses import dataclass
 from datetime import UTC, datetime
 from pathlib import Path
-from typing import Any
+from typing import TYPE_CHECKING, Any
+
+if TYPE_CHECKING:
+    from velocity.sweep import RunSpec, SweepResult
 
 RO_CRATE_CONTEXT = "https://w3id.org/ro/crate/1.1/context"
 RO_CRATE_PROFILE = "https://w3id.org/ro/crate/1.1"
@@ -250,3 +255,98 @@ def create_archive(
         zf.writestr("README.md", readme)
         zf.writestr("ro-crate-metadata.json", json.dumps(metadata, indent=2))
     return archive_path
+
+
+@dataclass
+class ArchiveContents:
+    """What `read_archive` recovers from a reproducibility crate."""
+
+    specs: list[RunSpec]
+    original: dict[str, Any] | None  # parsed comparison.json (a SweepResult dump), if present
+
+
+def read_archive(archive_path: Path) -> ArchiveContents:
+    """Recover the per-run `RunSpec`s (and original results) from a crate zip.
+
+    Reads the bundled `<run>/config.json` entries and `comparison.json` directly
+    from the archive — no extraction to disk needed to inspect or re-run it.
+    """
+    from velocity.sweep import RunSpec
+
+    archive_path = Path(archive_path)
+    specs: list[RunSpec] = []
+    original: dict[str, Any] | None = None
+    with zipfile.ZipFile(archive_path) as zf:
+        names = set(zf.namelist())
+        for name in sorted(n for n in names if n.endswith("/config.json")):
+            specs.append(RunSpec.model_validate_json(zf.read(name)))
+        if "comparison.json" in names:
+            original = json.loads(zf.read("comparison.json"))
+    if not specs:
+        raise ValueError(f"{archive_path} has no <run>/config.json — not a velocity archive")
+    return ArchiveContents(specs=specs, original=original)
+
+
+def reproduce_archive(archive_path: Path, *, out_dir: Path) -> SweepResult:
+    """Re-run an archived sweep from its bundled configs into ``out_dir``.
+
+    A reproduction in the ACM/NISO sense: same configs + code, re-executed. Reuses
+    the existing ``run_sweep`` runner (DRY) over the recovered specs.
+    """
+    from velocity.sweep import run_sweep
+
+    contents = read_archive(archive_path)
+    return run_sweep(contents.specs, out_dir=Path(out_dir))
+
+
+@dataclass
+class ResultDiff:
+    """One run's archived vs reproduced final loss, and whether they agree."""
+
+    name: str
+    original: float | None
+    reproduced: float
+    ok: bool
+
+
+def _loss_close(a: float | None, b: float, rel_tol: float) -> bool:
+    if a is None:
+        return False  # run absent from the original results
+    a_nan = isinstance(a, float) and a != a
+    b_nan = isinstance(b, float) and b != b
+    if a_nan and b_nan:
+        return True  # both undefined → not a mismatch (the offline stub's losses are nan)
+    if a_nan or b_nan:
+        return False
+    return math.isclose(a, b, rel_tol=rel_tol, abs_tol=0.0)
+
+
+def compare_results(
+    original: dict[str, Any], reproduced: SweepResult, *, rel_tol: float = 1e-6
+) -> list[ResultDiff]:
+    """Per-run final-loss agreement within a relative tolerance (nan-safe).
+
+    Tolerance-based, not bit-exact: ML / float aggregation is not bitwise
+    deterministic across runs and hardware, so asserting equality would emit false
+    failures. ``original`` is a parsed ``comparison.json`` (a ``SweepResult`` dump).
+    """
+    orig_runs = {r["spec"]["name"]: r for r in original.get("runs", [])}
+    diffs: list[ResultDiff] = []
+    for run in reproduced.runs:
+        name = run.spec.name
+        if name in orig_runs:
+            raw = orig_runs[name].get("final_loss")
+            # pydantic serializes an in-memory nan loss to JSON null, so a present
+            # run with null final_loss means nan, not "missing".
+            original_loss = float("nan") if raw is None else raw
+        else:
+            original_loss = None  # run absent from the archived results
+        diffs.append(
+            ResultDiff(
+                name=name,
+                original=original_loss,
+                reproduced=run.final_loss,
+                ok=_loss_close(original_loss, run.final_loss, rel_tol),
+            )
+        )
+    return diffs
diff --git a/python/velocity/cli.py b/python/velocity/cli.py
index 64b3279..5d392f1 100644
--- a/python/velocity/cli.py
+++ b/python/velocity/cli.py
@@ -466,3 +466,48 @@ def archive(
     except ValueError as e:
         raise typer.BadParameter(str(e)) from e
     typer.echo(f"Wrote {written}")
+
+
+@app.command()
+def reproduce(
+    archive: Path = typer.Argument(  # noqa: B008
+        ..., help="A reproducibility archive (.zip produced by `velocity archive`)."
+    ),
+    out: Path | None = typer.Option(  # noqa: B008
+        None, "--out", help="Output dir for the re-run (default: out/<ts>-reproduce)."
+    ),
+    check: bool = typer.Option(
+        False, "--check", help="Compare the re-run's results against the archived ones."
+    ),
+    tolerance: float = typer.Option(
+        1e-6, "--tolerance", help="Relative tolerance for --check (not bit-exact)."
+    ),
+) -> None:
+    """Re-run an archived sweep — a reproduction (same config + code, re-executed).
+
+    With ``--check``, compares each run's reproduced final loss against the
+    archived value within ``--tolerance`` (tolerance-based, since float
+    aggregation isn't bitwise reproducible) and exits non-zero on any mismatch.
+    """
+    from velocity.archive import compare_results, read_archive, reproduce_archive
+
+    if out is None:
+        ts = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ")
+        out = Path("out") / f"{ts}-reproduce"
+    try:
+        result = reproduce_archive(archive, out_dir=out)
+    except (ValueError, FileNotFoundError) as e:
+        raise typer.BadParameter(str(e)) from e
+    typer.echo(f"Reproduced {len(result.runs)} run(s) into {out}")
+
+    if check:
+        original = read_archive(archive).original
+        if original is None:
+            typer.echo("No archived results to check against (comparison.json absent).")
+            return
+        diffs = compare_results(original, result, rel_tol=tolerance)
+        for d in diffs:
+            mark = "ok" if d.ok else "MISMATCH"
+            typer.echo(f"  [{mark}] {d.name}: archived={d.original} reproduced={d.reproduced}")
+        if any(not d.ok for d in diffs):
+            raise typer.Exit(code=1)
diff --git a/tests/test_archive.py b/tests/test_archive.py
index edcfe91..be9411f 100644
--- a/tests/test_archive.py
+++ b/tests/test_archive.py
@@ -13,7 +13,10 @@
     RO_CRATE_CONTEXT,
     RO_CRATE_PROFILE,
     build_ro_crate_metadata,
+    compare_results,
     create_archive,
+    read_archive,
+    reproduce_archive,
 )
 from velocity.cli import app
 
@@ -150,3 +153,91 @@ def test_cli_archive_command_writes_zip(tmp_path):
     assert str(dest) in result.output
     with zipfile.ZipFile(dest) as zf:
         assert "ro-crate-metadata.json" in zf.namelist()
+
+
+def _real_sweep(tmp_path: Path, seed: int = 7) -> Path:
+    """A real (offline-stub) sweep dir with valid RunSpec config.json files."""
+    from velocity.strategy import FedAvg
+    from velocity.sweep import RunSpec, run_sweep
+
+    out = tmp_path / "orig-sweep"
+    run_sweep(
+        [RunSpec(name="fedavg-baseline", strategy=FedAvg(), rounds=1, min_clients=1, seed=seed)],
+        out_dir=out,
+    )
+    return out
+
+
+def test_read_archive_recovers_specs(tmp_path):
+    archive = create_archive(
+        _real_sweep(tmp_path, seed=7), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none"
+    )
+    contents = read_archive(archive)
+    assert [s.name for s in contents.specs] == ["fedavg-baseline"]
+    assert contents.specs[0].seed == 7
+    assert type(contents.specs[0].strategy).__name__ == "FedAvg"
+    assert contents.original is not None  # comparison.json recovered for --check
+
+
+def test_reproduce_archive_reruns(tmp_path):
+    archive = create_archive(
+        _real_sweep(tmp_path, seed=7), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none"
+    )
+    result = reproduce_archive(archive, out_dir=tmp_path / "repro")
+    assert {r.spec.name for r in result.runs} == {"fedavg-baseline"}
+    assert (tmp_path / "repro" / "comparison.json").exists()  # fresh sweep output written
+
+
+def test_compare_results_tolerance_and_nan():
+    from velocity.strategy import FedAvg
+    from velocity.sweep import RunResult, RunSpec, SweepResult
+
+    def mk(name: str, loss: float) -> RunResult:
+        return RunResult(
+            spec=RunSpec(name=name, strategy=FedAvg()),
+            rounds=[],
+            final_loss=loss,
+            mean_loss=loss,
+            elapsed_seconds=0.0,
+        )
+
+    reproduced = SweepResult(
+        runs=[mk("a", 0.5000001), mk("b", 9.9), mk("c", float("nan"))],
+        total_elapsed=1.0,
+        serial_elapsed=1.0,
+        parallel=1,
+        out_dir="x",
+    )
+    original = {
+        "runs": [
+            {"spec": {"name": "a"}, "final_loss": 0.5},
+            {"spec": {"name": "b"}, "final_loss": 1.0},
+            {"spec": {"name": "c"}, "final_loss": float("nan")},
+        ]
+    }
+    diffs = {d.name: d for d in compare_results(original, reproduced, rel_tol=1e-3)}
+    assert diffs["a"].ok  # 0.5000001 vs 0.5 within 1e-3
+    assert not diffs["b"].ok  # 9.9 vs 1.0 — clear mismatch
+    assert diffs["c"].ok  # both nan → undefined, not a mismatch (stub case)
+
+
+def test_cli_reproduce_command(tmp_path):
+    archive = create_archive(
+        _real_sweep(tmp_path, seed=3), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none"
+    )
+    out = tmp_path / "repro-out"
+    result = CliRunner().invoke(app, ["reproduce", str(archive), "--out", str(out)])
+    assert result.exit_code == 0, result.output
+    assert (out / "comparison.json").exists()
+
+
+def test_cli_reproduce_check_reports(tmp_path):
+    archive = create_archive(
+        _real_sweep(tmp_path, seed=3), archive_path=tmp_path / "a.zip", lockfile=tmp_path / "none"
+    )
+    result = CliRunner().invoke(
+        app, ["reproduce", str(archive), "--out", str(tmp_path / "r"), "--check"]
+    )
+    # Stub losses are nan (both-nan = ok), so --check passes and prints a report.
+    assert result.exit_code == 0, result.output
+    assert "fedavg-baseline" in result.output

From 45b9d0096505b2e03dc3ea4d7a4729842e7a1d44 Mon Sep 17 00:00:00 2001
From: AJ Barea <ajbareaa@gmail.com>
Date: Mon, 1 Jun 2026 07:21:17 -0400
Subject: [PATCH 2/2] docs(readme): list archive + reproduce in CLI reference +
 quickstart

---
 README.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/README.md b/README.md
index 703f2b3..beac35c 100644
--- a/README.md
+++ b/README.md
@@ -124,6 +124,8 @@ velocity run --model-id test/model --dataset test/dataset --rounds 1 --min-clien
 velocity simulate-attack model_poisoning --intensity 0.2
 velocity sweep --strategies FedAvg,Krum --attacks gaussian_noise --rounds 5
 velocity leaderboard --metric robustness
+velocity archive out/<ts>-sweep -o run.crate.zip   # bundle a sweep into a reproducibility archive
+velocity reproduce run.crate.zip --check            # re-run it elsewhere and verify results match
 ```
 
 ---
@@ -136,6 +138,8 @@ velocity leaderboard --metric robustness
 - `velocity simulate-attack ...` — register one attack and run a round
 - `velocity sweep ...` — run a strategy × attack matrix across seeds (see [`docs/sweep-spec.md`](docs/sweep-spec.md))
 - `velocity leaderboard ...` — rank stored runs (accuracy / rounds-to-target / wall-clock / comm-cost / pareto / pareto-slices / robustness)
+- `velocity archive ...` — package a sweep output into a single-file reproducibility archive (RO-Crate)
+- `velocity reproduce ...` — re-run an archived sweep (`--check` verifies results within tolerance)
 
 Full reference: [`docs/cli.md`](docs/cli.md)