Skip to content

Val/test decay tracking in judge scoring #7

@poofeth

Description

@poofeth

Problem

The v2 judge scores experiments based on validation metrics (walk-forward folds, composite score). But validation metrics systematically overpredict out-of-sample performance. In a real deployment, every champion showed 26-77% Sharpe decay from validation to held-out test, and three strategies flipped sign entirely.

The scoring weights don't account for this decay, meaning the composite score is a poor predictor of live performance. A strategy scored at 1.03 on validation may deliver 0.5-0.7 in practice.

Proposal

Add decay tracking to the judge and scoring system.

Track decay ratios per branch

When test-set evaluations are available, compute:

decay_ratio = test_metric / val_metric

Store rolling decay ratios in branch_beliefs.json:

{
  "branch_name": {
    "val_test_decay_ratios": [0.45, 0.52, 0.38],
    "median_decay_ratio": 0.45,
    "decay_trend": "stable"
  }
}

Decay-adjusted scoring

When computing the composite score, apply the branch's median decay ratio:

def compute_decay_adjusted_score(val_composite, decay_ratio):
    # If median decay is 0.5 (test is half of val), discount by 0.5
    return val_composite * max(0.3, decay_ratio)  # floor at 0.3 to avoid zeroing out

Overfitting flag

When a branch's median decay ratio drops below 0.3 (test is less than 30% of validation):

  • Flag the branch as overfitting_prone: true in beliefs
  • Require test-set confirmation before promotion (run experiment on held-out data, not just walk-forward)
  • Log in handoff: "Branch {X} has median val/test decay of {Y}. Promotions on this branch require test-set confirmation."

Lab-wide decay tracking

Track aggregate decay across ALL branches:

{
  "lab_decay": {
    "median_all_branches": 0.52,
    "worst_branch": {"name": "framing_bias", "decay": 0.23},
    "best_branch": {"name": "competition", "decay": 0.92}
  }
}

Surface in human checkpoint reports.

Why this matters

A judge that doesn't know its own accuracy is unreliable. If the lab observes that val scores consistently overpredict test by 2x, the scoring weights should adapt. Without this, the framework keeps promoting strategies that look great on validation and underperform on test -- which is the definition of overfitting to the validation procedure.

Relationship to existing features

  • Extends the judge's effect_size and information_gain gates with a temporal dimension
  • Feeds into the frame_challenge: "our scoring metric doesn't predict live performance" is a frame invalidation
  • Works with diagnostics: a "val_test_comparison" diagnostic would compute decay ratios
  • The human checkpoint report should include the lab-wide decay table

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions