-
Notifications
You must be signed in to change notification settings - Fork 3
Val/test decay tracking in judge scoring #7
Description
Problem
The v2 judge scores experiments based on validation metrics (walk-forward folds, composite score). But validation metrics systematically overpredict out-of-sample performance. In a real deployment, every champion showed 26-77% Sharpe decay from validation to held-out test, and three strategies flipped sign entirely.
The scoring weights don't account for this decay, meaning the composite score is a poor predictor of live performance. A strategy scored at 1.03 on validation may deliver 0.5-0.7 in practice.
Proposal
Add decay tracking to the judge and scoring system.
Track decay ratios per branch
When test-set evaluations are available, compute:
decay_ratio = test_metric / val_metric
Store rolling decay ratios in branch_beliefs.json:
{
"branch_name": {
"val_test_decay_ratios": [0.45, 0.52, 0.38],
"median_decay_ratio": 0.45,
"decay_trend": "stable"
}
}Decay-adjusted scoring
When computing the composite score, apply the branch's median decay ratio:
def compute_decay_adjusted_score(val_composite, decay_ratio):
# If median decay is 0.5 (test is half of val), discount by 0.5
return val_composite * max(0.3, decay_ratio) # floor at 0.3 to avoid zeroing outOverfitting flag
When a branch's median decay ratio drops below 0.3 (test is less than 30% of validation):
- Flag the branch as
overfitting_prone: truein beliefs - Require test-set confirmation before promotion (run experiment on held-out data, not just walk-forward)
- Log in handoff: "Branch {X} has median val/test decay of {Y}. Promotions on this branch require test-set confirmation."
Lab-wide decay tracking
Track aggregate decay across ALL branches:
{
"lab_decay": {
"median_all_branches": 0.52,
"worst_branch": {"name": "framing_bias", "decay": 0.23},
"best_branch": {"name": "competition", "decay": 0.92}
}
}Surface in human checkpoint reports.
Why this matters
A judge that doesn't know its own accuracy is unreliable. If the lab observes that val scores consistently overpredict test by 2x, the scoring weights should adapt. Without this, the framework keeps promoting strategies that look great on validation and underperform on test -- which is the definition of overfitting to the validation procedure.
Relationship to existing features
- Extends the judge's effect_size and information_gain gates with a temporal dimension
- Feeds into the frame_challenge: "our scoring metric doesn't predict live performance" is a frame invalidation
- Works with diagnostics: a "val_test_comparison" diagnostic would compute decay ratios
- The human checkpoint report should include the lab-wide decay table