Val/test decay tracking in judge scoring

## Problem

The v2 judge scores experiments based on validation metrics (walk-forward folds, composite score). But validation metrics systematically overpredict out-of-sample performance. In a real deployment, every champion showed 26-77% Sharpe decay from validation to held-out test, and three strategies flipped sign entirely.

The scoring weights don't account for this decay, meaning the composite score is a poor predictor of live performance. A strategy scored at 1.03 on validation may deliver 0.5-0.7 in practice.

## Proposal

Add decay tracking to the judge and scoring system.

### Track decay ratios per branch

When test-set evaluations are available, compute:
```
decay_ratio = test_metric / val_metric
```

Store rolling decay ratios in `branch_beliefs.json`:
```json
{
  "branch_name": {
    "val_test_decay_ratios": [0.45, 0.52, 0.38],
    "median_decay_ratio": 0.45,
    "decay_trend": "stable"
  }
}
```

### Decay-adjusted scoring

When computing the composite score, apply the branch's median decay ratio:
```python
def compute_decay_adjusted_score(val_composite, decay_ratio):
    # If median decay is 0.5 (test is half of val), discount by 0.5
    return val_composite * max(0.3, decay_ratio)  # floor at 0.3 to avoid zeroing out
```

### Overfitting flag

When a branch's median decay ratio drops below 0.3 (test is less than 30% of validation):
- Flag the branch as `overfitting_prone: true` in beliefs
- Require test-set confirmation before promotion (run experiment on held-out data, not just walk-forward)
- Log in handoff: "Branch {X} has median val/test decay of {Y}. Promotions on this branch require test-set confirmation."

### Lab-wide decay tracking

Track aggregate decay across ALL branches:
```json
{
  "lab_decay": {
    "median_all_branches": 0.52,
    "worst_branch": {"name": "framing_bias", "decay": 0.23},
    "best_branch": {"name": "competition", "decay": 0.92}
  }
}
```

Surface in human checkpoint reports.

## Why this matters

A judge that doesn't know its own accuracy is unreliable. If the lab observes that val scores consistently overpredict test by 2x, the scoring weights should adapt. Without this, the framework keeps promoting strategies that look great on validation and underperform on test -- which is the definition of overfitting to the validation procedure.

## Relationship to existing features

- Extends the judge's effect_size and information_gain gates with a temporal dimension
- Feeds into the frame_challenge: "our scoring metric doesn't predict live performance" is a frame invalidation
- Works with diagnostics: a "val_test_comparison" diagnostic would compute decay ratios
- The human checkpoint report should include the lab-wide decay table

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Val/test decay tracking in judge scoring #7

Problem

Proposal

Track decay ratios per branch

Decay-adjusted scoring

Overfitting flag

Lab-wide decay tracking

Why this matters

Relationship to existing features

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Val/test decay tracking in judge scoring #7

Description

Problem

Proposal

Track decay ratios per branch

Decay-adjusted scoring

Overfitting flag

Lab-wide decay tracking

Why this matters

Relationship to existing features

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions