fix(research-gbm): tighten 21d-horizon overfitting + formalize observe rail#180
Merged
Merged
Conversation
…e rail ROADMAP L1816 P1 (promoted from P2 2026-05-15; "ship now anyway" direction 2026-05-19 despite the entry's own >=30-week-corpus advisory — Brian explicitly opted in at 22 weeks). Bounded-risk fix: the Ridge regularization absorbs today's overfit (`research_calibrator_prob` std-coef only +0.05; meta_IC driven by `sector_macro_modifier` + `research_composite_score`) so this is an integrity tightening, NOT a bleeding-alpha fix. ## Evidence the fix targets The 2026-05-09 promote-true retrain at the new 21d horizon (post Track A PR 4/5 canonical-alpha cutover) showed: train_ic=0.4328 / val_ic=0.0857 ⇒ ratio 5.05 on 496 rows Threshold cited in alpha-engine-predictor PR #113's manifest spec is >2×; 5.05× is decisive overfit. Cause: the 21d horizon shrinks the labeled fit set vs the pre-cutover 5d horizon (fewer rows have finite `actual_fwd` labels) and `num_leaves=15 / max_depth=4` was sized for the 5d larger-corpus regime. ## What ships 1. **`model/research_gbm.py::_default_params`**: `num_leaves` 15→8 + `max_depth` 4→3. Caps the LightGBM hypothesis space below the row count of a 21d-horizon corpus. Docstring carries the ROADMAP citation + the bounded-risk rationale. Other params unchanged (lambda_l1/l2, min_child_samples, learning_rate kept — they weren't the overfit lever). 2. **`training/meta_trainer.py`** — formalize the observe rail: - `_compute_overfit_signal(train_ic, val_ic, warn_threshold=3.0)` returns `(ratio, warn)`. Magnitude-based denominator so a negative val_ic still measures train-fit dominance. Returns `(None, False)` when train_ic missing OR |val_ic| < 1e-3 noise floor — surfaces "couldn't measure" as a distinct state from "measured and healthy". - `_emit_research_gbm_overfit_metrics(...)` dispatches CloudWatch gauges (namespace `AlphaEngine/Predictor`): `research_gbm_train_ic`, `research_gbm_val_ic`, `research_gbm_train_val_ic_ratio`, `research_gbm_overfit_warn`. Best-effort emit (mirrors existing `_emit_research_join_coverage_metrics` pattern); a CW failure WARNs + continues training. - The Step 6c block now computes the ratio + warn after `train_ic` and: (a) WARN-logs when the flag fires (loud, immediate signal); (b) emits the CW gauges; (c) carries the ratio + warn into the manifest's `research_gbm` block alongside `train_ic` + `val_ic`. - Threshold 3.0 chosen as the halfway between the 2× watch level + the 5.05× we observed — operator alerts only on confirmed regression. - 2-cycle persistence alarm is satisfied by setting up the CW alarm on `research_gbm_overfit_warn > 0` with `DatapointsToAlarm=2 / EvaluationPeriods=2` (weekly cadence); no code change needed once the gauges exist. 3. **Tests** (+13 net): - `tests/test_research_gbm_scorer.py`: `test_default_params_overfit_tightening_2026_05_19` pins 8/3 vs a future revert. - `tests/test_research_gbm_overfit_signal.py` (new, 13 tests): ratio formula (incl. the historical 5.05 case), threshold boundary (strict >), None contracts (train_ic absent / val_ic noise-floor), negative-val_ic abs handling, custom threshold override, CW emit contract (full metric set when warn / 0-warn when healthy / no emit when can't-measure / defensive train_ic absent / best-effort on CloudWatch failure). ## What does NOT ship - **Per-fold WF GBM training**: the structurally-clean remediation (currently the Step 6c fit is a single 80/20 temporal split). At 22 weeks of labeled history, per-fold rows fall below `min_child_samples=30`; the entry's own caveat at line 1347-1353 documents this. Re-promoted to the entry's "do at >=30 weeks (~2026-06-20)" milestone. - **Manifest schema change beyond additive fields**: `train_val_ic_ratio` + `overfit_warn` are additive only. Pre-existing consumers see identical fields; downstream `feature_drift` / dashboard reads ignore unknown keys (no contract break). ## Tests pytest tests/ -q -> 1110 passed, 0 failed Composes with: alpha-engine-predictor PR #113 (manifest train/val IC emit substrate), PR #114 (canonical-alpha cutover that surfaced the overfit by shrinking the fit set), feedback_component_baseline_validation (L1-component subsample gate already protects deploys from a degenerate research_gbm — this entry adds observe-rail visibility on top). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ROADMAP L1816 P1 (promoted from P2 2026-05-15; ship-now-anyway direction 2026-05-19 despite the entry's own ≥30-week-corpus advisory — explicit opt-in at 22 weeks). Bounded-risk fix: the Ridge regularization absorbs today's overfit (
research_calibrator_probstd-coef only +0.05; meta_IC driven bysector_macro_modifier+research_composite_score), so this is an integrity tightening, NOT a bleeding-alpha fix.Evidence
The 2026-05-09 promote-true retrain at the new 21d horizon showed:
Threshold cited in PR #113's manifest spec is >2×; 5.05× is decisive overfit. Cause: the 21d horizon shrinks the labeled fit set vs the pre-cutover 5d horizon, and
num_leaves=15 / max_depth=4was sized for the 5d larger-corpus regime.What ships
model/research_gbm.py::_default_paramsnum_leaves15→8,max_depth4→3training/meta_trainer.py_compute_overfit_signal(train_ic, val_ic)helper +_emit_research_gbm_overfit_metrics(...)CW dispatcher, both wired into Step 6c after the existing train_ic computation. Warns loud on ratio > 3.0; manifestresearch_gbmblock now carriestrain_val_ic_ratio+overfit_warnalongside existingtrain_ic+val_ictests/test_research_gbm_scorer.pytest_default_params_overfit_tightening_2026_05_19pins 8/3 vs a future reverttests/test_research_gbm_overfit_signal.py(new, 13 tests)2-cycle persistence alarm is satisfied by setting up the CW alarm on
research_gbm_overfit_warn > 0withDatapointsToAlarm=2 / EvaluationPeriods=2(weekly cadence); no code change needed once the gauges exist.What does NOT ship
min_child_samples=30; the entry's own caveat at line 1347-1353 documents this. Re-promoted to the entry's>=30 weeks (~2026-06-20)milestone.Tests
Composes with: PR #113 (manifest train/val IC emit substrate), PR #114 (canonical-alpha cutover that surfaced the overfit by shrinking the fit set), [[feedback_component_baseline_validation]].
🤖 Generated with Claude Code