diff --git a/IMPL.md b/IMPL.md index b412ade..14da235 100644 --- a/IMPL.md +++ b/IMPL.md @@ -1,32 +1,25 @@ -# IMPL: reproducibility loop complete — nothing in flight +# IMPL: strategy-param fingerprint fidelity shipped — `hyperparameter_sage` next -Session-by-session checklist for what's actively in flight. Long-horizon -planning lives in [ROADMAP.md](ROADMAP.md). +Session-by-session checklist. Long-horizon in [ROADMAP.md](ROADMAP.md). ## In flight -_Nothing open._ The reproducibility loop is complete (both shipped 2026-06-01, -ROADMAP → Completed): - -- **`velocity archive `** — package a sweep output into a single-file - RO-Crate (Process Run Crate, `.zip`): the sweep artifacts plus a `uv.lock` - snapshot, a how-to-reproduce README, and a spec-conformant - `ro-crate-metadata.json`. Zero new dependency (hand-rolled JSON-LD). -- **`velocity reproduce [--check]`** — the inverse: recover the - per-run `RunSpec`s from the crate, re-run via `run_sweep`, and (with `--check`) - verify each run's final loss against the archived value within a relative - tolerance (not bit-exact; nan-safe). Exits non-zero on a real mismatch. - -Both are CLI-only (no MCP-contract churn), built on the existing `sweep` -machinery (DRY), and live in `velocity.archive`. - -## Next up (queued, not active) - -See ROADMAP. The clean CLI-only runway around the sweep/archive area is tapped; -what remains needs design or research judgment rather than execution — the -leaderboard read-side items (cross-config normalisation, the LLM-backed A2A -auditors: convergence / robustness / hyperparameter) and the research-tier -streams (server-side DP, streaming aggregation, compression). +_Nothing open._ **Strategy hyperparameters now participate in the run +fingerprint** (shipped 2026-06-01, ROADMAP → Completed). `strategy.strategy_params()` ++ the extracted `mcp_app._real_run_config` record a run's hyperparameters under a +`strategy_params` config key, so Krum f=2 and f=3 are no longer collapsed into one +leaderboard row. Fixed a latent per-fingerprint-board correctness gap (the +producers stored only the strategy name) and unblocked `hyperparameter_sage`. + +## Next pickup + +- **`hyperparameter_sage`** (ROADMAP → A2A specialists) — now data-unblocked. An + MCP tool that, given a target config, ranks the top-k hyperparameter values + observed in *matched* runs (mean±std accuracy, sample count), hard-failing below + a sample threshold (start: 10) per the Sage guard-rails. The guard-rails are + statistically grounded (variance in ML benchmarks, arXiv:2103.03098). Remaining + is a design pass on the precise "matched runs" semantics + the threshold — worth + AJ's eye, since the recommendation semantics are a research-judgment call. When picking one up, replace this file with a full session plan (Why / Decisions / Scope / Out of scope / Definition of done). diff --git a/ROADMAP.md b/ROADMAP.md index 2fc3e14..fe0bbda 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -319,7 +319,10 @@ rather than the curated, dumped arena CSV the first cut renders. **shipped 2026-06-01**, see Completed), `hyperparameter_sage` (given a target config, returns the top-k α / μ / f values observed in matched runs, with sample count + variance, and flags when sample size is too low to - recommend). + recommend). `hyperparameter_sage`'s **data prerequisite is now met** — strategy + hyperparameters participate in the config fingerprint as of 2026-06-01 (see + Completed), so runs differing only in α / μ / f are distinguishable; what + remains is the design pass on what "matched" means + the sample thresholds. - **Sage guard-rails** — any sage answer must quote sample size and variance. "α=0.3 was top-3 over 47 runs on MNIST+shard+no-attack, IQR ±0.008 final accuracy" is useful; "use α=0.3" is cargo cult. @@ -441,6 +444,20 @@ a dash is illegal. Only display/brand prose is "Velocity-FL". Authoritative records: git history, `docs/benchmarks.md`, `docs/convergence.md`, `docs/strategies.md`. This index is pruned once work is durably shipped. +- 2026-06-01 — **Strategy hyperparameters now in the run fingerprint (leaderboard + fidelity; unblocks `hyperparameter_sage`).** Both DB producers stored only the + strategy *name* (`strategy_name(strat)`), so a strategy's hyperparameters (Krum + `f`, FedProx `mu`, TrimmedMean `k`, …) never reached the config — and + `config_fingerprint`, despite its docstring, couldn't distinguish them. Krum f=2 + and f=3 shared a fingerprint, so every per-fingerprint board silently averaged + *different* experiments into one row. Added `strategy.strategy_params(strat)` + (dataclass fields → dict; `{}` for paramless strategies, so their fingerprints are + unchanged) and extracted `mcp_app._real_run_config`, which records both the name + (for the `runs.strategy` column) and `strategy_params` so the existing + canonical-JSON fingerprint resolves them. Only `run_real_training` carries + parameterised strategies (`run_demo`'s string input is paramless-only — no bug, no + change). No MCP / CLI surface change. TDD (helper + distinct-fingerprint guard); + full suite + lint green. Unblocks the roadmapped `hyperparameter_sage`. - 2026-06-01 — **`velocity reproduce` — re-run an archived sweep (closes the loop).** The inverse of `velocity archive`: `read_archive` recovers the per-run `RunSpec`s (and the original `comparison.json`) straight from the crate zip; `reproduce_archive` diff --git a/python/velocity/db.py b/python/velocity/db.py index 2d241df..6abd0a6 100644 --- a/python/velocity/db.py +++ b/python/velocity/db.py @@ -224,8 +224,10 @@ def config_fingerprint(config: dict[str, Any]) -> str: Two runs that differ only in ``seed`` (or per-commit ``git_sha``) share a fingerprint, so the leaderboard can group repeats and report mean±std the way ``scripts/dump_attack_arena.py`` already does across seeds. Everything - else — dataset, partition, strategy, attack, their params, ``vfl_version`` - — is identity and participates in the hash. + else — dataset, partition, strategy (with its hyperparameters, recorded + under the ``strategy_params`` key by the producer so Krum f=2 and f=3 don't + collide), attack and its params, ``vfl_version`` — is identity and + participates in the hash. research(2026-05): RFC 8785 (JSON Canonicalization Scheme) → SHA-256 is the cross-language standard for content-addressing JSON. This fingerprint is diff --git a/python/velocity/mcp_app.py b/python/velocity/mcp_app.py index 424a3f5..6a0b4da 100644 --- a/python/velocity/mcp_app.py +++ b/python/velocity/mcp_app.py @@ -983,6 +983,23 @@ async def run_real_training( ) +def _real_run_config(strategy: Any, **fields: Any) -> dict[str, Any]: + """Run config for a real-training run. + + Records the strategy *name* (for the ``runs.strategy`` column + display) and + its hyperparameters under ``strategy_params``, so ``config_fingerprint`` + resolves runs that differ only in a hyperparameter (Krum f=2 vs f=3) rather + than conflating them. ``fields`` are the remaining config entries. + """ + from velocity.strategy import strategy_name, strategy_params + + return { + "strategy": strategy_name(strategy), + "strategy_params": strategy_params(strategy), + **fields, + } + + async def _execute_real_training( *, user_id: str, @@ -1103,7 +1120,7 @@ def _run_real_training_sync( from velocity import _core from velocity.datasets import load_federated from velocity.paper_attacks import craft_byzantine_updates - from velocity.strategy import FedProx, strategy_name + from velocity.strategy import FedProx from velocity.training import ( evaluate, layer_shapes, @@ -1153,21 +1170,21 @@ def make_model() -> nn.Module: # strategy keeps the FedAvg-style 0.0 proximal term. proximal_mu = strategy.mu if isinstance(strategy, FedProx) else 0.0 - config = { - "strategy": strategy_name(strategy), - "model_id": model_id, - "dataset": dataset, - "rounds": rounds, - "num_clients": num_clients, - "local_epochs": local_epochs, - "batch_size": batch_size, - "lr": lr, - "seed": seed, - "partition": partition, - "partition_kwargs": partition_kwargs, - "attack": attack, - "mode": "real", - } + config = _real_run_config( + strategy, + model_id=model_id, + dataset=dataset, + rounds=rounds, + num_clients=num_clients, + local_epochs=local_epochs, + batch_size=batch_size, + lr=lr, + seed=seed, + partition=partition, + partition_kwargs=partition_kwargs, + attack=attack, + mode="real", + ) # num_malicious is part of the attack spec; absent on clean runs so their # config fingerprint is unchanged. robustness_delta_leaderboard strips it # (with `attack`) to pair an attacked run with its no-attack baseline. diff --git a/python/velocity/strategy.py b/python/velocity/strategy.py index 70107bc..9310f32 100644 --- a/python/velocity/strategy.py +++ b/python/velocity/strategy.py @@ -208,6 +208,18 @@ def strategy_name(strategy: Strategy) -> str: return type(strategy).__name__ +def strategy_params(strategy: Strategy) -> dict[str, Any]: + """Hyperparameters of a strategy instance as a dict — ``{"f": 2}`` for + ``Krum(f=2)``, ``{}`` for paramless strategies (FedAvg, FedMedian, ArKrum). + + Recorded under ``config["strategy_params"]`` so a strategy's hyperparameters + participate in :func:`velocity.db.config_fingerprint`. Without it, runs that + differ only in a hyperparameter (Krum f=2 vs f=3) share a fingerprint and the + leaderboard conflates them. + """ + return {f.name: getattr(strategy, f.name) for f in fields(strategy)} + + def parse_strategy(value: str | dict[str, Any] | Strategy) -> Strategy: """Coerce a user-supplied value into a strategy instance. diff --git a/tests/test_mcp_real_training.py b/tests/test_mcp_real_training.py index a2b7c97..1c5d6ee 100644 --- a/tests/test_mcp_real_training.py +++ b/tests/test_mcp_real_training.py @@ -393,3 +393,21 @@ def mk(state: dict) -> Any: ) assert isinstance(updates[0], _core.ClientUpdate), attack assert updates[0].num_samples == 10, attack + + +def test_real_run_config_records_strategy_params() -> None: + """A run config records strategy hyperparameters so the fingerprint resolves them. + + Before this, only the strategy *name* was stored, so Krum f=2 and f=3 + collapsed to one config_fingerprint and the leaderboard conflated them. + """ + from velocity.strategy import FedAvg, Krum + + c2 = mcp_app._real_run_config(Krum(f=2), dataset="mnist", rounds=5) + c3 = mcp_app._real_run_config(Krum(f=3), dataset="mnist", rounds=5) + assert c2["strategy"] == "Krum" # name preserved for the runs.strategy column + assert c2["strategy_params"] == {"f": 2} + # Distinct hyperparameters → distinct fingerprints (no longer conflated). + assert db.config_fingerprint(c2) != db.config_fingerprint(c3) + # Paramless strategies carry an empty params dict (fingerprint unchanged). + assert mcp_app._real_run_config(FedAvg(), dataset="mnist")["strategy_params"] == {} diff --git a/tests/test_strategy.py b/tests/test_strategy.py index e96b9f0..91540c1 100644 --- a/tests/test_strategy.py +++ b/tests/test_strategy.py @@ -19,6 +19,7 @@ complexity_for, parse_strategy, strategy_name, + strategy_params, ) @@ -188,3 +189,15 @@ def test_complexity_for_accepts_name_and_instance(): def test_complexity_for_unknown_name_raises(): with pytest.raises(KeyError): complexity_for("FedNope") + + +def test_strategy_params_extracts_dataclass_fields(): + # Paramless strategies → empty dict (so their config fingerprint is unchanged). + assert strategy_params(FedAvg()) == {} + assert strategy_params(FedMedian()) == {} + assert strategy_params(ArKrum()) == {} + # Parameterised strategies → their hyperparameters, so the fingerprint resolves them. + assert strategy_params(Krum(f=2)) == {"f": 2} + assert strategy_params(FedProx(mu=0.1)) == {"mu": 0.1} + assert strategy_params(TrimmedMean(k=1)) == {"k": 1} + assert strategy_params(MultiKrum(f=1, m=3)) == {"f": 1, "m": 3}