ajbarea · ajbarea · Jun 1, 2026 · Jun 1, 2026
diff --git a/IMPL.md b/IMPL.md
@@ -1,32 +1,25 @@
-# IMPL: reproducibility loop complete — nothing in flight
+# IMPL: strategy-param fingerprint fidelity shipped — `hyperparameter_sage` next
 
-Session-by-session checklist for what's actively in flight. Long-horizon
-planning lives in [ROADMAP.md](ROADMAP.md).
+Session-by-session checklist. Long-horizon in [ROADMAP.md](ROADMAP.md).
 
 ## In flight
 
-_Nothing open._ The reproducibility loop is complete (both shipped 2026-06-01,
-ROADMAP → Completed):
-
-- **`velocity archive <out-dir>`** — package a sweep output into a single-file
-  RO-Crate (Process Run Crate, `.zip`): the sweep artifacts plus a `uv.lock`
-  snapshot, a how-to-reproduce README, and a spec-conformant
-  `ro-crate-metadata.json`. Zero new dependency (hand-rolled JSON-LD).
-- **`velocity reproduce <archive.zip> [--check]`** — the inverse: recover the
-  per-run `RunSpec`s from the crate, re-run via `run_sweep`, and (with `--check`)
-  verify each run's final loss against the archived value within a relative
-  tolerance (not bit-exact; nan-safe). Exits non-zero on a real mismatch.
-
-Both are CLI-only (no MCP-contract churn), built on the existing `sweep`
-machinery (DRY), and live in `velocity.archive`.
-
-## Next up (queued, not active)
-
-See ROADMAP. The clean CLI-only runway around the sweep/archive area is tapped;
-what remains needs design or research judgment rather than execution — the
-leaderboard read-side items (cross-config normalisation, the LLM-backed A2A
-auditors: convergence / robustness / hyperparameter) and the research-tier
-streams (server-side DP, streaming aggregation, compression).
+_Nothing open._ **Strategy hyperparameters now participate in the run
+fingerprint** (shipped 2026-06-01, ROADMAP → Completed). `strategy.strategy_params()`
++ the extracted `mcp_app._real_run_config` record a run's hyperparameters under a
+`strategy_params` config key, so Krum f=2 and f=3 are no longer collapsed into one
+leaderboard row. Fixed a latent per-fingerprint-board correctness gap (the
+producers stored only the strategy name) and unblocked `hyperparameter_sage`.
+
+## Next pickup
+
+- **`hyperparameter_sage`** (ROADMAP → A2A specialists) — now data-unblocked. An
+  MCP tool that, given a target config, ranks the top-k hyperparameter values
+  observed in *matched* runs (mean±std accuracy, sample count), hard-failing below
+  a sample threshold (start: 10) per the Sage guard-rails. The guard-rails are
+  statistically grounded (variance in ML benchmarks, arXiv:2103.03098). Remaining
+  is a design pass on the precise "matched runs" semantics + the threshold — worth
+  AJ's eye, since the recommendation semantics are a research-judgment call.
 
 When picking one up, replace this file with a full session plan
 (Why / Decisions / Scope / Out of scope / Definition of done).
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -319,7 +319,10 @@ rather than the curated, dumped arena CSV the first cut renders.
   **shipped 2026-06-01**, see Completed), `hyperparameter_sage` (given a target config, returns the
   top-k α / μ / f values observed in matched runs, with sample
   count + variance, and flags when sample size is too low to
-  recommend).
+  recommend). `hyperparameter_sage`'s **data prerequisite is now met** — strategy
+  hyperparameters participate in the config fingerprint as of 2026-06-01 (see
+  Completed), so runs differing only in α / μ / f are distinguishable; what
+  remains is the design pass on what "matched" means + the sample thresholds.
 - **Sage guard-rails** — any sage answer must quote sample size and
   variance. "α=0.3 was top-3 over 47 runs on MNIST+shard+no-attack,
   IQR ±0.008 final accuracy" is useful; "use α=0.3" is cargo cult.
@@ -441,6 +444,20 @@ a dash is illegal. Only display/brand prose is "Velocity-FL".
 
 Authoritative records: git history, `docs/benchmarks.md`, `docs/convergence.md`, `docs/strategies.md`. This index is pruned once work is durably shipped.
 
+- 2026-06-01 — **Strategy hyperparameters now in the run fingerprint (leaderboard
+  fidelity; unblocks `hyperparameter_sage`).** Both DB producers stored only the
+  strategy *name* (`strategy_name(strat)`), so a strategy's hyperparameters (Krum
+  `f`, FedProx `mu`, TrimmedMean `k`, …) never reached the config — and
+  `config_fingerprint`, despite its docstring, couldn't distinguish them. Krum f=2
+  and f=3 shared a fingerprint, so every per-fingerprint board silently averaged
+  *different* experiments into one row. Added `strategy.strategy_params(strat)`
+  (dataclass fields → dict; `{}` for paramless strategies, so their fingerprints are
+  unchanged) and extracted `mcp_app._real_run_config`, which records both the name
+  (for the `runs.strategy` column) and `strategy_params` so the existing
+  canonical-JSON fingerprint resolves them. Only `run_real_training` carries
+  parameterised strategies (`run_demo`'s string input is paramless-only — no bug, no
+  change). No MCP / CLI surface change. TDD (helper + distinct-fingerprint guard);
+  full suite + lint green. Unblocks the roadmapped `hyperparameter_sage`.
 - 2026-06-01 — **`velocity reproduce` — re-run an archived sweep (closes the loop).**
   The inverse of `velocity archive`: `read_archive` recovers the per-run `RunSpec`s
   (and the original `comparison.json`) straight from the crate zip; `reproduce_archive`

diff --git a/python/velocity/db.py b/python/velocity/db.py
@@ -224,8 +224,10 @@ def config_fingerprint(config: dict[str, Any]) -> str:
     Two runs that differ only in ``seed`` (or per-commit ``git_sha``) share a
     fingerprint, so the leaderboard can group repeats and report mean±std the
     way ``scripts/dump_attack_arena.py`` already does across seeds. Everything
-    else — dataset, partition, strategy, attack, their params, ``vfl_version``
-    — is identity and participates in the hash.
+    else — dataset, partition, strategy (with its hyperparameters, recorded
+    under the ``strategy_params`` key by the producer so Krum f=2 and f=3 don't
+    collide), attack and its params, ``vfl_version`` — is identity and
+    participates in the hash.
 
     research(2026-05): RFC 8785 (JSON Canonicalization Scheme) → SHA-256 is the
     cross-language standard for content-addressing JSON. This fingerprint is

diff --git a/python/velocity/mcp_app.py b/python/velocity/mcp_app.py
@@ -983,6 +983,23 @@ async def run_real_training(
     )
 
 
+def _real_run_config(strategy: Any, **fields: Any) -> dict[str, Any]:
+    """Run config for a real-training run.
+
+    Records the strategy *name* (for the ``runs.strategy`` column + display) and
+    its hyperparameters under ``strategy_params``, so ``config_fingerprint``
+    resolves runs that differ only in a hyperparameter (Krum f=2 vs f=3) rather
+    than conflating them. ``fields`` are the remaining config entries.
+    """
+    from velocity.strategy import strategy_name, strategy_params
+
+    return {
+        "strategy": strategy_name(strategy),
+        "strategy_params": strategy_params(strategy),
+        **fields,
+    }
+
+
 async def _execute_real_training(
     *,
     user_id: str,
@@ -1103,7 +1120,7 @@ def _run_real_training_sync(
     from velocity import _core
     from velocity.datasets import load_federated
     from velocity.paper_attacks import craft_byzantine_updates
-    from velocity.strategy import FedProx, strategy_name
+    from velocity.strategy import FedProx
     from velocity.training import (
         evaluate,
         layer_shapes,
@@ -1153,21 +1170,21 @@ def make_model() -> nn.Module:
     # strategy keeps the FedAvg-style 0.0 proximal term.
     proximal_mu = strategy.mu if isinstance(strategy, FedProx) else 0.0
 
-    config = {
-        "strategy": strategy_name(strategy),
-        "model_id": model_id,
-        "dataset": dataset,
-        "rounds": rounds,
-        "num_clients": num_clients,
-        "local_epochs": local_epochs,
-        "batch_size": batch_size,
-        "lr": lr,
-        "seed": seed,
-        "partition": partition,
-        "partition_kwargs": partition_kwargs,
-        "attack": attack,
-        "mode": "real",
-    }
+    config = _real_run_config(
+        strategy,
+        model_id=model_id,
+        dataset=dataset,
+        rounds=rounds,
+        num_clients=num_clients,
+        local_epochs=local_epochs,
+        batch_size=batch_size,
+        lr=lr,
+        seed=seed,
+        partition=partition,
+        partition_kwargs=partition_kwargs,
+        attack=attack,
+        mode="real",
+    )
     # num_malicious is part of the attack spec; absent on clean runs so their
     # config fingerprint is unchanged. robustness_delta_leaderboard strips it
     # (with `attack`) to pair an attacked run with its no-attack baseline.

diff --git a/python/velocity/strategy.py b/python/velocity/strategy.py
@@ -208,6 +208,18 @@ def strategy_name(strategy: Strategy) -> str:
     return type(strategy).__name__
 
 
+def strategy_params(strategy: Strategy) -> dict[str, Any]:
+    """Hyperparameters of a strategy instance as a dict — ``{"f": 2}`` for
+    ``Krum(f=2)``, ``{}`` for paramless strategies (FedAvg, FedMedian, ArKrum).
+
+    Recorded under ``config["strategy_params"]`` so a strategy's hyperparameters
+    participate in :func:`velocity.db.config_fingerprint`. Without it, runs that
+    differ only in a hyperparameter (Krum f=2 vs f=3) share a fingerprint and the
+    leaderboard conflates them.
+    """
+    return {f.name: getattr(strategy, f.name) for f in fields(strategy)}
+
+
 def parse_strategy(value: str | dict[str, Any] | Strategy) -> Strategy:
     """Coerce a user-supplied value into a strategy instance.
 

diff --git a/tests/test_mcp_real_training.py b/tests/test_mcp_real_training.py
@@ -393,3 +393,21 @@ def mk(state: dict) -> Any:
         )
         assert isinstance(updates[0], _core.ClientUpdate), attack
         assert updates[0].num_samples == 10, attack
+
+
+def test_real_run_config_records_strategy_params() -> None:
+    """A run config records strategy hyperparameters so the fingerprint resolves them.
+
+    Before this, only the strategy *name* was stored, so Krum f=2 and f=3
+    collapsed to one config_fingerprint and the leaderboard conflated them.
+    """
+    from velocity.strategy import FedAvg, Krum
+
+    c2 = mcp_app._real_run_config(Krum(f=2), dataset="mnist", rounds=5)
+    c3 = mcp_app._real_run_config(Krum(f=3), dataset="mnist", rounds=5)
+    assert c2["strategy"] == "Krum"  # name preserved for the runs.strategy column
+    assert c2["strategy_params"] == {"f": 2}
+    # Distinct hyperparameters → distinct fingerprints (no longer conflated).
+    assert db.config_fingerprint(c2) != db.config_fingerprint(c3)
+    # Paramless strategies carry an empty params dict (fingerprint unchanged).
+    assert mcp_app._real_run_config(FedAvg(), dataset="mnist")["strategy_params"] == {}
diff --git a/tests/test_strategy.py b/tests/test_strategy.py
@@ -19,6 +19,7 @@
     complexity_for,
     parse_strategy,
     strategy_name,
+    strategy_params,
 )
 
 
@@ -188,3 +189,15 @@ def test_complexity_for_accepts_name_and_instance():
 def test_complexity_for_unknown_name_raises():
     with pytest.raises(KeyError):
         complexity_for("FedNope")
+
+
+def test_strategy_params_extracts_dataclass_fields():
+    # Paramless strategies → empty dict (so their config fingerprint is unchanged).
+    assert strategy_params(FedAvg()) == {}
+    assert strategy_params(FedMedian()) == {}
+    assert strategy_params(ArKrum()) == {}
+    # Parameterised strategies → their hyperparameters, so the fingerprint resolves them.
+    assert strategy_params(Krum(f=2)) == {"f": 2}
+    assert strategy_params(FedProx(mu=0.1)) == {"mu": 0.1}
+    assert strategy_params(TrimmedMean(k=1)) == {"k": 1}
+    assert strategy_params(MultiKrum(f=1, m=3)) == {"f": 1, "m": 3}