Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 23 additions & 25 deletions IMPL.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,32 @@
# IMPL: reproducibility archive shipped`velocity reproduce` next
# IMPL: reproducibility loop completenothing in flight

Session-by-session checklist for what's actively in flight. Long-horizon
planning lives in [ROADMAP.md](ROADMAP.md).

## In flight

_Nothing open._ The **`velocity archive`** reproducibility-archive generator
shipped 2026-06-01 (ROADMAP → Completed): it packages a `velocity sweep` output
directory into a single-file **RO-Crate** (Process Run Crate profile, `.zip`) —
the sweep artifacts plus a `uv.lock` snapshot (fallback `installed-packages.txt`),
a how-to-reproduce `README.md`, and a hand-rolled, spec-conformant
`ro-crate-metadata.json` — with **zero new dependency**. New importable
`velocity.archive` module; `velocity archive <out-dir> [-o] [--lockfile]` CLI,
reusing the sweep-time `capture_manifest()` provenance (DRY).

The web-search at the format decision point paid off: the ROADMAP had guessed a
hand-rolled `.tar.gz`, but RO-Crate (Process Run Crate) is the 2026 standard for
packaging a run with machine-readable provenance, and the spec lets us emit the
JSON-LD by hand (no `rocrate` dep). Also corrected the ROADMAP's inaccurate
`velocity run --save-reproducible-archive` host — `velocity run` is
stateless/seedless, so the archive operates on `velocity sweep` output instead.

## Next pickup

- **`velocity reproduce <archive.zip>`** (ROADMAP → Audit-of-audit follow-ups) —
the inverse of `archive`: unpack a crate, recover the per-run `RunSpec`s from
the bundled `config.json`s, re-run via `run_sweep`, diff against the bundled
results. The archive's README already documents the manual round-trip (this
session verified it executes); this automates it. Offline-testable via the
demo/stub server — no GPU/HF.
_Nothing open._ The reproducibility loop is complete (both shipped 2026-06-01,
ROADMAP → Completed):

- **`velocity archive <out-dir>`** — package a sweep output into a single-file
RO-Crate (Process Run Crate, `.zip`): the sweep artifacts plus a `uv.lock`
snapshot, a how-to-reproduce README, and a spec-conformant
`ro-crate-metadata.json`. Zero new dependency (hand-rolled JSON-LD).
- **`velocity reproduce <archive.zip> [--check]`** — the inverse: recover the
per-run `RunSpec`s from the crate, re-run via `run_sweep`, and (with `--check`)
verify each run's final loss against the archived value within a relative
tolerance (not bit-exact; nan-safe). Exits non-zero on a real mismatch.

Both are CLI-only (no MCP-contract churn), built on the existing `sweep`
machinery (DRY), and live in `velocity.archive`.

## Next up (queued, not active)

See ROADMAP. The clean CLI-only runway around the sweep/archive area is tapped;
what remains needs design or research judgment rather than execution — the
leaderboard read-side items (cross-config normalisation, the LLM-backed A2A
auditors: convergence / robustness / hyperparameter) and the research-tier
streams (server-side DP, streaming aggregation, compression).

When picking one up, replace this file with a full session plan
(Why / Decisions / Scope / Out of scope / Definition of done).
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ velocity run --model-id test/model --dataset test/dataset --rounds 1 --min-clien
velocity simulate-attack model_poisoning --intensity 0.2
velocity sweep --strategies FedAvg,Krum --attacks gaussian_noise --rounds 5
velocity leaderboard --metric robustness
velocity archive out/<ts>-sweep -o run.crate.zip # bundle a sweep into a reproducibility archive
velocity reproduce run.crate.zip --check # re-run it elsewhere and verify results match
```

---
Expand All @@ -136,6 +138,8 @@ velocity leaderboard --metric robustness
- `velocity simulate-attack ...` — register one attack and run a round
- `velocity sweep ...` — run a strategy × attack matrix across seeds (see [`docs/sweep-spec.md`](docs/sweep-spec.md))
- `velocity leaderboard ...` — rank stored runs (accuracy / rounds-to-target / wall-clock / comm-cost / pareto / pareto-slices / robustness)
- `velocity archive ...` — package a sweep output into a single-file reproducibility archive (RO-Crate)
- `velocity reproduce ...` — re-run an archived sweep (`--check` verifies results within tolerance)

Full reference: [`docs/cli.md`](docs/cli.md)

Expand Down
20 changes: 14 additions & 6 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -407,12 +407,6 @@ rather than the curated, dumped arena CSV the first cut renders.

> Source: 2026-05-21 audit-of-audits review (deleted after extraction). Items that survived verdict review but don't fit Compression / Privacy / Streaming cleanly.

- **`velocity reproduce <archive.zip>`** — the inverse of the now-shipped
`velocity archive` (see Completed): unpack a reproducibility crate, recover the
per-run `RunSpec`s from the bundled `config.json`s, re-run via `run_sweep`, and
diff against the bundled results. The archive's README already documents the
manual round-trip (verified working end-to-end); this automates it. Cheaply
testable — the demo/stub server runs offline, no GPU/HF. Tier 1.
- **Cross-silo Pareto benchmark suite** — power-law (Pareto: 20% clients hold 80% data) realistic distribution as a benchmark axis alongside the existing IID + Dirichlet partitioners. Measure convergence + per-client accuracy variance + robustness-under-attack on the same skew. Real FL deployments are cross-silo, not equal-sized; benchmarks should reflect that. Fits under `## Performance`. Tier 3 research.

## Cross-sister polish (2026-05-21)
Expand Down Expand Up @@ -447,6 +441,20 @@ a dash is illegal. Only display/brand prose is "Velocity-FL".

Authoritative records: git history, `docs/benchmarks.md`, `docs/convergence.md`, `docs/strategies.md`. This index is pruned once work is durably shipped.

- 2026-06-01 — **`velocity reproduce` — re-run an archived sweep (closes the loop).**
The inverse of `velocity archive`: `read_archive` recovers the per-run `RunSpec`s
(and the original `comparison.json`) straight from the crate zip; `reproduce_archive`
re-runs them through the existing `run_sweep` (DRY). `velocity reproduce <archive.zip>
[--out] [--check] [--tolerance]`; `--check` compares each run's reproduced final loss
against the archived value within a relative tolerance and exits non-zero on a real
mismatch. Tolerance-based, **not** bit-exact — ML / float aggregation isn't bitwise
reproducible across runs/hardware — and nan-safe (pydantic serializes an in-memory nan
loss to JSON null, so a present run with null loss is read as nan and doesn't
false-mismatch a reproduced nan; an absent run is a real mismatch). TDD (spec recovery
+ end-to-end re-run on the offline stub + tolerance/nan logic + CLI); full suite + lint
green; documented in docs/cli.md (roster guard). research(2026-06): ACM/NISO "reproduced"
= same config + code re-executed (acm.org/publications/badging-terms); numerical
non-determinism in ML → tolerance not equality (arXiv:2506.09501, arXiv:2302.12691).
- 2026-06-01 — **Reproducibility archive generator (`velocity archive`).** Packages a
sweep output directory into a single-file **RO-Crate** (Process Run Crate profile,
`.zip`): the existing sweep artifacts (`manifest.json`, `comparison.{json,md}`,
Expand Down
23 changes: 23 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,29 @@ uv run velocity archive out/20260601T100000Z-sweep -o paper-artifact.zip

---

## `velocity reproduce`

Re-runs an archived sweep — a *reproduction* (same config and code, re-executed) of a bundle produced by [`velocity archive`](#velocity-archive). Optionally verifies the re-run matches the archived results.

```bash
# Re-run an archive into out/<ts>-reproduce
uv run velocity reproduce paper-artifact.zip

# Re-run and verify results match the archived ones, within a relative tolerance
uv run velocity reproduce paper-artifact.zip --check --tolerance 1e-4
```

| Argument / Option | Type | Default | Description |
|---|---|---|---|
| `ARCHIVE` (positional) | `path` | none | A reproducibility archive `.zip` from `velocity archive`. |
| `--out` | `path` | `out/<timestamp>-reproduce` | Output directory for the re-run. |
| `--check` | flag | off | Compare reproduced per-run final loss against the archived values. |
| `--tolerance` | `float` | `1e-6` | Relative tolerance for `--check`. Not bit-exact — float aggregation isn't bitwise reproducible across runs/hardware. |

**Output** — a fresh sweep output directory. With `--check`, a per-run `ok` / `MISMATCH` report that exits non-zero if any run's result diverges beyond tolerance.

---

## Exit codes

| Code | Meaning |
Expand Down
102 changes: 101 additions & 1 deletion python/velocity/archive.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,15 @@

import importlib.metadata
import json
import math
import zipfile
from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from typing import TYPE_CHECKING, Any

if TYPE_CHECKING:
from velocity.sweep import RunSpec, SweepResult

RO_CRATE_CONTEXT = "https://w3id.org/ro/crate/1.1/context"
RO_CRATE_PROFILE = "https://w3id.org/ro/crate/1.1"
Expand Down Expand Up @@ -250,3 +255,98 @@ def create_archive(
zf.writestr("README.md", readme)
zf.writestr("ro-crate-metadata.json", json.dumps(metadata, indent=2))
return archive_path


@dataclass
class ArchiveContents:
"""What `read_archive` recovers from a reproducibility crate."""

specs: list[RunSpec]
original: dict[str, Any] | None # parsed comparison.json (a SweepResult dump), if present


def read_archive(archive_path: Path) -> ArchiveContents:
"""Recover the per-run `RunSpec`s (and original results) from a crate zip.

Reads the bundled `<run>/config.json` entries and `comparison.json` directly
from the archive — no extraction to disk needed to inspect or re-run it.
"""
from velocity.sweep import RunSpec

archive_path = Path(archive_path)
specs: list[RunSpec] = []
original: dict[str, Any] | None = None
with zipfile.ZipFile(archive_path) as zf:
names = set(zf.namelist())
for name in sorted(n for n in names if n.endswith("/config.json")):
specs.append(RunSpec.model_validate_json(zf.read(name)))
if "comparison.json" in names:
original = json.loads(zf.read("comparison.json"))
if not specs:
raise ValueError(f"{archive_path} has no <run>/config.json — not a velocity archive")
return ArchiveContents(specs=specs, original=original)


def reproduce_archive(archive_path: Path, *, out_dir: Path) -> SweepResult:
"""Re-run an archived sweep from its bundled configs into ``out_dir``.

A reproduction in the ACM/NISO sense: same configs + code, re-executed. Reuses
the existing ``run_sweep`` runner (DRY) over the recovered specs.
"""
from velocity.sweep import run_sweep

contents = read_archive(archive_path)
return run_sweep(contents.specs, out_dir=Path(out_dir))


@dataclass
class ResultDiff:
"""One run's archived vs reproduced final loss, and whether they agree."""

name: str
original: float | None
reproduced: float
ok: bool


def _loss_close(a: float | None, b: float, rel_tol: float) -> bool:
if a is None:
return False # run absent from the original results
a_nan = isinstance(a, float) and a != a
b_nan = isinstance(b, float) and b != b
if a_nan and b_nan:
return True # both undefined → not a mismatch (the offline stub's losses are nan)
if a_nan or b_nan:
return False
return math.isclose(a, b, rel_tol=rel_tol, abs_tol=0.0)


def compare_results(
original: dict[str, Any], reproduced: SweepResult, *, rel_tol: float = 1e-6
) -> list[ResultDiff]:
"""Per-run final-loss agreement within a relative tolerance (nan-safe).

Tolerance-based, not bit-exact: ML / float aggregation is not bitwise
deterministic across runs and hardware, so asserting equality would emit false
failures. ``original`` is a parsed ``comparison.json`` (a ``SweepResult`` dump).
"""
orig_runs = {r["spec"]["name"]: r for r in original.get("runs", [])}
diffs: list[ResultDiff] = []
for run in reproduced.runs:
name = run.spec.name
if name in orig_runs:
raw = orig_runs[name].get("final_loss")
# pydantic serializes an in-memory nan loss to JSON null, so a present
# run with null final_loss means nan, not "missing".
original_loss = float("nan") if raw is None else raw
else:
original_loss = None # run absent from the archived results
diffs.append(
ResultDiff(
name=name,
original=original_loss,
reproduced=run.final_loss,
ok=_loss_close(original_loss, run.final_loss, rel_tol),
)
)
return diffs
45 changes: 45 additions & 0 deletions python/velocity/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,3 +466,48 @@ def archive(
except ValueError as e:
raise typer.BadParameter(str(e)) from e
typer.echo(f"Wrote {written}")


@app.command()
def reproduce(
archive: Path = typer.Argument( # noqa: B008
..., help="A reproducibility archive (.zip produced by `velocity archive`)."
),
out: Path | None = typer.Option( # noqa: B008
None, "--out", help="Output dir for the re-run (default: out/<ts>-reproduce)."
),
check: bool = typer.Option(
False, "--check", help="Compare the re-run's results against the archived ones."
),
tolerance: float = typer.Option(
1e-6, "--tolerance", help="Relative tolerance for --check (not bit-exact)."
),
) -> None:
"""Re-run an archived sweep — a reproduction (same config + code, re-executed).

With ``--check``, compares each run's reproduced final loss against the
archived value within ``--tolerance`` (tolerance-based, since float
aggregation isn't bitwise reproducible) and exits non-zero on any mismatch.
"""
from velocity.archive import compare_results, read_archive, reproduce_archive

if out is None:
ts = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ")
out = Path("out") / f"{ts}-reproduce"
try:
result = reproduce_archive(archive, out_dir=out)
except (ValueError, FileNotFoundError) as e:
raise typer.BadParameter(str(e)) from e
typer.echo(f"Reproduced {len(result.runs)} run(s) into {out}")

if check:
original = read_archive(archive).original
if original is None:
typer.echo("No archived results to check against (comparison.json absent).")
return
diffs = compare_results(original, result, rel_tol=tolerance)
for d in diffs:
mark = "ok" if d.ok else "MISMATCH"
typer.echo(f" [{mark}] {d.name}: archived={d.original} reproduced={d.reproduced}")
if any(not d.ok for d in diffs):
raise typer.Exit(code=1)
Loading