Skip to content

fix(leaderboard): record strategy hyperparameters in the run fingerprint#82

Merged
ajbarea merged 1 commit into
mainfrom
feat/strategy-params-fingerprint
Jun 1, 2026
Merged

fix(leaderboard): record strategy hyperparameters in the run fingerprint#82
ajbarea merged 1 commit into
mainfrom
feat/strategy-params-fingerprint

Conversation

@ajbarea
Copy link
Copy Markdown
Owner

@ajbarea ajbarea commented Jun 1, 2026

The bug (latent leaderboard-correctness gap)

Both DB producers built the run config with config["strategy"] = strategy_name(strat) — only the strategy name. So a strategy's hyperparameters (Krum f, FedProx mu, TrimmedMean k, MultiKrum/Bulyan f/m, …) were never recorded. config_fingerprint's docstring claims "strategy … their params … participates in the hash", but they couldn't — they weren't stored.

Consequence: Krum f=2 and f=3 share a fingerprint, so every per-fingerprint board (accuracy, wall-clock, comm-cost, pareto, robustness) silently averages different experiments into one row. It also left hyperparameter_sage (roadmapped) with no hyperparameter data to mine.

Found while picking up hyperparameter_sage — verifying the data foundation before building on it.

The fix

  • strategy.strategy_params(strat) -> dict (mirrors strategy_name; dataclass fields → dict; {} for paramless FedAvg/FedMedian/ArKrum, so their fingerprints are unchanged — no churn).
  • Extract mcp_app._real_run_config(strategy, **fields) from the heavy _execute_real_training, recording both the name (for the runs.strategy column + display) and strategy_params. The existing canonical-JSON config_fingerprint then resolves params with no fingerprint-logic change. Extracting it also makes the fix unit-testable without running real training.
  • Only run_real_training is touched — run_demo takes a strategy string and parse_strategy can't parameterise a bare name (Krum needs f → falls back to FedAvg), so it's paramless-only: no bug, no change (YAGNI).
  • Corrected the config_fingerprint docstring to state the mechanism.

No MCP tool-signature or CLI change → no surface-hash re-pin, no docs catalog churn.

Testing

  • TDD: strategy_params over every strategy shape; _real_run_config records params and Krum f=2 ≠ f=3 fingerprints (the regression guard for exactly this bug).
  • Whole-repo make lint (ruff + ty) green; full suite 381 passed, 9 skipped.

Research

The Sage guard-rails this unblocks (quote sample size + variance, hard-fail on too-few seeds) are statistically grounded: hyperparameter-choice variance markedly affects ML benchmark results (arXiv:2103.03098). hyperparameter_sage itself stays on the ROADMAP — its "matched runs" semantics + thresholds are a design pass worth a human eye.

The DB producers stored only the strategy name (strategy_name), so a strategy's
hyperparameters (Krum f, FedProx mu, TrimmedMean k, ...) never reached the run
config. config_fingerprint — despite its docstring — couldn't distinguish them,
so Krum f=2 and f=3 shared a fingerprint and every per-fingerprint leaderboard
silently averaged different experiments into one row.

Add strategy.strategy_params() (dataclass fields -> dict; {} for paramless
strategies, so their fingerprints are unchanged) and extract
mcp_app._real_run_config, which records both the name (for the runs.strategy
column) and strategy_params so the existing canonical-JSON fingerprint resolves
them. Only run_real_training carries parameterised strategies (run_demo's string
input is paramless-only, so it has no bug and is unchanged). No MCP/CLI surface
change.

Unblocks the roadmapped hyperparameter_sage.
@ajbarea ajbarea enabled auto-merge (squash) June 1, 2026 13:34
@ajbarea ajbarea merged commit 6f26cc3 into main Jun 1, 2026
6 checks passed
@ajbarea ajbarea deleted the feat/strategy-params-fingerprint branch June 1, 2026 13:36
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Jun 1, 2026

Merging this PR will not alter performance

✅ 41 untouched benchmarks
⏩ 10 skipped benchmarks1


Comparing feat/strategy-params-fingerprint (ac0c279) with main (eec467b)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant