Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 18 additions & 25 deletions IMPL.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,25 @@
# IMPL: reproducibility loop complete — nothing in flight
# IMPL: strategy-param fingerprint fidelity shipped — `hyperparameter_sage` next

Session-by-session checklist for what's actively in flight. Long-horizon
planning lives in [ROADMAP.md](ROADMAP.md).
Session-by-session checklist. Long-horizon in [ROADMAP.md](ROADMAP.md).

## In flight

_Nothing open._ The reproducibility loop is complete (both shipped 2026-06-01,
ROADMAP → Completed):

- **`velocity archive <out-dir>`** — package a sweep output into a single-file
RO-Crate (Process Run Crate, `.zip`): the sweep artifacts plus a `uv.lock`
snapshot, a how-to-reproduce README, and a spec-conformant
`ro-crate-metadata.json`. Zero new dependency (hand-rolled JSON-LD).
- **`velocity reproduce <archive.zip> [--check]`** — the inverse: recover the
per-run `RunSpec`s from the crate, re-run via `run_sweep`, and (with `--check`)
verify each run's final loss against the archived value within a relative
tolerance (not bit-exact; nan-safe). Exits non-zero on a real mismatch.

Both are CLI-only (no MCP-contract churn), built on the existing `sweep`
machinery (DRY), and live in `velocity.archive`.

## Next up (queued, not active)

See ROADMAP. The clean CLI-only runway around the sweep/archive area is tapped;
what remains needs design or research judgment rather than execution — the
leaderboard read-side items (cross-config normalisation, the LLM-backed A2A
auditors: convergence / robustness / hyperparameter) and the research-tier
streams (server-side DP, streaming aggregation, compression).
_Nothing open._ **Strategy hyperparameters now participate in the run
fingerprint** (shipped 2026-06-01, ROADMAP → Completed). `strategy.strategy_params()`
+ the extracted `mcp_app._real_run_config` record a run's hyperparameters under a
`strategy_params` config key, so Krum f=2 and f=3 are no longer collapsed into one
leaderboard row. Fixed a latent per-fingerprint-board correctness gap (the
producers stored only the strategy name) and unblocked `hyperparameter_sage`.

## Next pickup

- **`hyperparameter_sage`** (ROADMAP → A2A specialists) — now data-unblocked. An
MCP tool that, given a target config, ranks the top-k hyperparameter values
observed in *matched* runs (mean±std accuracy, sample count), hard-failing below
a sample threshold (start: 10) per the Sage guard-rails. The guard-rails are
statistically grounded (variance in ML benchmarks, arXiv:2103.03098). Remaining
is a design pass on the precise "matched runs" semantics + the threshold — worth
AJ's eye, since the recommendation semantics are a research-judgment call.

When picking one up, replace this file with a full session plan
(Why / Decisions / Scope / Out of scope / Definition of done).
19 changes: 18 additions & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,10 @@ rather than the curated, dumped arena CSV the first cut renders.
**shipped 2026-06-01**, see Completed), `hyperparameter_sage` (given a target config, returns the
top-k α / μ / f values observed in matched runs, with sample
count + variance, and flags when sample size is too low to
recommend).
recommend). `hyperparameter_sage`'s **data prerequisite is now met** — strategy
hyperparameters participate in the config fingerprint as of 2026-06-01 (see
Completed), so runs differing only in α / μ / f are distinguishable; what
remains is the design pass on what "matched" means + the sample thresholds.
- **Sage guard-rails** — any sage answer must quote sample size and
variance. "α=0.3 was top-3 over 47 runs on MNIST+shard+no-attack,
IQR ±0.008 final accuracy" is useful; "use α=0.3" is cargo cult.
Expand Down Expand Up @@ -441,6 +444,20 @@ a dash is illegal. Only display/brand prose is "Velocity-FL".

Authoritative records: git history, `docs/benchmarks.md`, `docs/convergence.md`, `docs/strategies.md`. This index is pruned once work is durably shipped.

- 2026-06-01 — **Strategy hyperparameters now in the run fingerprint (leaderboard
fidelity; unblocks `hyperparameter_sage`).** Both DB producers stored only the
strategy *name* (`strategy_name(strat)`), so a strategy's hyperparameters (Krum
`f`, FedProx `mu`, TrimmedMean `k`, …) never reached the config — and
`config_fingerprint`, despite its docstring, couldn't distinguish them. Krum f=2
and f=3 shared a fingerprint, so every per-fingerprint board silently averaged
*different* experiments into one row. Added `strategy.strategy_params(strat)`
(dataclass fields → dict; `{}` for paramless strategies, so their fingerprints are
unchanged) and extracted `mcp_app._real_run_config`, which records both the name
(for the `runs.strategy` column) and `strategy_params` so the existing
canonical-JSON fingerprint resolves them. Only `run_real_training` carries
parameterised strategies (`run_demo`'s string input is paramless-only — no bug, no
change). No MCP / CLI surface change. TDD (helper + distinct-fingerprint guard);
full suite + lint green. Unblocks the roadmapped `hyperparameter_sage`.
- 2026-06-01 — **`velocity reproduce` — re-run an archived sweep (closes the loop).**
The inverse of `velocity archive`: `read_archive` recovers the per-run `RunSpec`s
(and the original `comparison.json`) straight from the crate zip; `reproduce_archive`
Expand Down
6 changes: 4 additions & 2 deletions python/velocity/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,8 +224,10 @@ def config_fingerprint(config: dict[str, Any]) -> str:
Two runs that differ only in ``seed`` (or per-commit ``git_sha``) share a
fingerprint, so the leaderboard can group repeats and report mean±std the
way ``scripts/dump_attack_arena.py`` already does across seeds. Everything
else — dataset, partition, strategy, attack, their params, ``vfl_version``
— is identity and participates in the hash.
else — dataset, partition, strategy (with its hyperparameters, recorded
under the ``strategy_params`` key by the producer so Krum f=2 and f=3 don't
collide), attack and its params, ``vfl_version`` — is identity and
participates in the hash.

research(2026-05): RFC 8785 (JSON Canonicalization Scheme) → SHA-256 is the
cross-language standard for content-addressing JSON. This fingerprint is
Expand Down
49 changes: 33 additions & 16 deletions python/velocity/mcp_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -983,6 +983,23 @@ async def run_real_training(
)


def _real_run_config(strategy: Any, **fields: Any) -> dict[str, Any]:
"""Run config for a real-training run.

Records the strategy *name* (for the ``runs.strategy`` column + display) and
its hyperparameters under ``strategy_params``, so ``config_fingerprint``
resolves runs that differ only in a hyperparameter (Krum f=2 vs f=3) rather
than conflating them. ``fields`` are the remaining config entries.
"""
from velocity.strategy import strategy_name, strategy_params

return {
"strategy": strategy_name(strategy),
"strategy_params": strategy_params(strategy),
**fields,
}


async def _execute_real_training(
*,
user_id: str,
Expand Down Expand Up @@ -1103,7 +1120,7 @@ def _run_real_training_sync(
from velocity import _core
from velocity.datasets import load_federated
from velocity.paper_attacks import craft_byzantine_updates
from velocity.strategy import FedProx, strategy_name
from velocity.strategy import FedProx
from velocity.training import (
evaluate,
layer_shapes,
Expand Down Expand Up @@ -1153,21 +1170,21 @@ def make_model() -> nn.Module:
# strategy keeps the FedAvg-style 0.0 proximal term.
proximal_mu = strategy.mu if isinstance(strategy, FedProx) else 0.0

config = {
"strategy": strategy_name(strategy),
"model_id": model_id,
"dataset": dataset,
"rounds": rounds,
"num_clients": num_clients,
"local_epochs": local_epochs,
"batch_size": batch_size,
"lr": lr,
"seed": seed,
"partition": partition,
"partition_kwargs": partition_kwargs,
"attack": attack,
"mode": "real",
}
config = _real_run_config(
strategy,
model_id=model_id,
dataset=dataset,
rounds=rounds,
num_clients=num_clients,
local_epochs=local_epochs,
batch_size=batch_size,
lr=lr,
seed=seed,
partition=partition,
partition_kwargs=partition_kwargs,
attack=attack,
mode="real",
)
# num_malicious is part of the attack spec; absent on clean runs so their
# config fingerprint is unchanged. robustness_delta_leaderboard strips it
# (with `attack`) to pair an attacked run with its no-attack baseline.
Expand Down
12 changes: 12 additions & 0 deletions python/velocity/strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,18 @@ def strategy_name(strategy: Strategy) -> str:
return type(strategy).__name__


def strategy_params(strategy: Strategy) -> dict[str, Any]:
"""Hyperparameters of a strategy instance as a dict — ``{"f": 2}`` for
``Krum(f=2)``, ``{}`` for paramless strategies (FedAvg, FedMedian, ArKrum).

Recorded under ``config["strategy_params"]`` so a strategy's hyperparameters
participate in :func:`velocity.db.config_fingerprint`. Without it, runs that
differ only in a hyperparameter (Krum f=2 vs f=3) share a fingerprint and the
leaderboard conflates them.
"""
return {f.name: getattr(strategy, f.name) for f in fields(strategy)}


def parse_strategy(value: str | dict[str, Any] | Strategy) -> Strategy:
"""Coerce a user-supplied value into a strategy instance.

Expand Down
18 changes: 18 additions & 0 deletions tests/test_mcp_real_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -393,3 +393,21 @@ def mk(state: dict) -> Any:
)
assert isinstance(updates[0], _core.ClientUpdate), attack
assert updates[0].num_samples == 10, attack


def test_real_run_config_records_strategy_params() -> None:
"""A run config records strategy hyperparameters so the fingerprint resolves them.

Before this, only the strategy *name* was stored, so Krum f=2 and f=3
collapsed to one config_fingerprint and the leaderboard conflated them.
"""
from velocity.strategy import FedAvg, Krum

c2 = mcp_app._real_run_config(Krum(f=2), dataset="mnist", rounds=5)
c3 = mcp_app._real_run_config(Krum(f=3), dataset="mnist", rounds=5)
assert c2["strategy"] == "Krum" # name preserved for the runs.strategy column
assert c2["strategy_params"] == {"f": 2}
# Distinct hyperparameters → distinct fingerprints (no longer conflated).
assert db.config_fingerprint(c2) != db.config_fingerprint(c3)
# Paramless strategies carry an empty params dict (fingerprint unchanged).
assert mcp_app._real_run_config(FedAvg(), dataset="mnist")["strategy_params"] == {}
13 changes: 13 additions & 0 deletions tests/test_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
complexity_for,
parse_strategy,
strategy_name,
strategy_params,
)


Expand Down Expand Up @@ -188,3 +189,15 @@ def test_complexity_for_accepts_name_and_instance():
def test_complexity_for_unknown_name_raises():
with pytest.raises(KeyError):
complexity_for("FedNope")


def test_strategy_params_extracts_dataclass_fields():
# Paramless strategies → empty dict (so their config fingerprint is unchanged).
assert strategy_params(FedAvg()) == {}
assert strategy_params(FedMedian()) == {}
assert strategy_params(ArKrum()) == {}
# Parameterised strategies → their hyperparameters, so the fingerprint resolves them.
assert strategy_params(Krum(f=2)) == {"f": 2}
assert strategy_params(FedProx(mu=0.1)) == {"mu": 0.1}
assert strategy_params(TrimmedMean(k=1)) == {"k": 1}
assert strategy_params(MultiKrum(f=1, m=3)) == {"f": 1, "m": 3}