fix(leaderboard): record strategy hyperparameters in the run fingerprint#82
Merged
Conversation
The DB producers stored only the strategy name (strategy_name), so a strategy's
hyperparameters (Krum f, FedProx mu, TrimmedMean k, ...) never reached the run
config. config_fingerprint — despite its docstring — couldn't distinguish them,
so Krum f=2 and f=3 shared a fingerprint and every per-fingerprint leaderboard
silently averaged different experiments into one row.
Add strategy.strategy_params() (dataclass fields -> dict; {} for paramless
strategies, so their fingerprints are unchanged) and extract
mcp_app._real_run_config, which records both the name (for the runs.strategy
column) and strategy_params so the existing canonical-JSON fingerprint resolves
them. Only run_real_training carries parameterised strategies (run_demo's string
input is paramless-only, so it has no bug and is unchanged). No MCP/CLI surface
change.
Unblocks the roadmapped hyperparameter_sage.
Contributor
Merging this PR will not alter performance
Comparing Footnotes
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug (latent leaderboard-correctness gap)
Both DB producers built the run config with
config["strategy"] = strategy_name(strat)— only the strategy name. So a strategy's hyperparameters (Krumf, FedProxmu, TrimmedMeank, MultiKrum/Bulyanf/m, …) were never recorded.config_fingerprint's docstring claims "strategy … their params … participates in the hash", but they couldn't — they weren't stored.Consequence: Krum f=2 and f=3 share a fingerprint, so every per-fingerprint board (accuracy, wall-clock, comm-cost, pareto, robustness) silently averages different experiments into one row. It also left
hyperparameter_sage(roadmapped) with no hyperparameter data to mine.Found while picking up
hyperparameter_sage— verifying the data foundation before building on it.The fix
strategy.strategy_params(strat) -> dict(mirrorsstrategy_name; dataclass fields → dict;{}for paramless FedAvg/FedMedian/ArKrum, so their fingerprints are unchanged — no churn).mcp_app._real_run_config(strategy, **fields)from the heavy_execute_real_training, recording both the name (for theruns.strategycolumn + display) andstrategy_params. The existing canonical-JSONconfig_fingerprintthen resolves params with no fingerprint-logic change. Extracting it also makes the fix unit-testable without running real training.run_real_trainingis touched —run_demotakes a strategy string andparse_strategycan't parameterise a bare name (Krum needsf→ falls back to FedAvg), so it's paramless-only: no bug, no change (YAGNI).config_fingerprintdocstring to state the mechanism.No MCP tool-signature or CLI change → no surface-hash re-pin, no docs catalog churn.
Testing
strategy_paramsover every strategy shape;_real_run_configrecords params and Krum f=2 ≠ f=3 fingerprints (the regression guard for exactly this bug).make lint(ruff + ty) green; full suite 381 passed, 9 skipped.Research
The Sage guard-rails this unblocks (quote sample size + variance, hard-fail on too-few seeds) are statistically grounded: hyperparameter-choice variance markedly affects ML benchmark results (arXiv:2103.03098).
hyperparameter_sageitself stays on the ROADMAP — its "matched runs" semantics + thresholds are a design pass worth a human eye.