feat(risk-model): weekly F+D S3 persistence from PredictorTraining (C.2b)#202
Merged
Conversation
….2b) ROADMAP C.2b — wires the C.2a math primitives (``risk_model.build_factor_risk_model``, PR #200) into the Saturday SF PredictorTraining stage and persists F + D parquets to ``s3://{bucket}/risk_model/{date}/{F,D}.parquet`` weekly. Plus metadata.json with build params + shape diagnostics. This is the production-persistence layer for the structural factor risk model. The actual wiring of Σ = B · F · Bᵀ + D into ``executor.portfolio_optimizer.solve_target_weights`` (workstream C.3) is gated on the B.5 cutover gate per the plan doc HARD SEQUENCING CONSTRAINT — two simultaneous Σ-substrate changes make backtester regressions untraceable. C.2b ships the weekly persistence so by the time C.3 reads ``risk_model/{date}/``, there are ≥4 weeks of F + D accumulated. What the stage does (~10s for the build, dominated by the cross-sectional OLS at each date): 1. Reads each ticker parquet from the train_handler's tmp_cache (the same per-ticker files download_from_arctic just populated). 2. Extracts Close → log returns (T × N panel). 3. Extracts the 8 ``*_zscore`` factor-loading columns from C.1 (alpha-engine-data #324). Tickers missing all 8 are skipped at the per-ticker level; dates with any-NaN per-row are skipped at the per-date level. 4. Calls ``build_factor_risk_model`` — Fama-MacBeth 1973 cross-sectional OLS → (K_eff × K_eff) F covariance + (N,) D idiosyncratic variance + metadata. 5. Writes F.parquet, D.parquet, metadata.json to ``s3://{bucket}/risk_model/{date}/``. Graceful degradation: - ``status=skipped`` when returns_panel < 60 dates (universe cache empty or freshly bootstrapped) - ``status=skipped`` when fewer than 30 tickers carry all 8 loading columns (pre-2026-05-26 universe cache, before #324 shipped). Stage auto-activates once C.1 loadings have accumulated. - Malformed parquets skipped silently at read time (one bad ticker can't abort the weekly build). - S3 persist failure is best-effort + WARN (mirrors the sweep stage pattern). The build itself + the in-process model dict survive. Wired into ``training/train_handler.py:main()`` as Step 2d2 (after training summary write, before triple-barrier cutover gate). Non- blocking: failure logs a WARN and does NOT abort the training pipeline. Skipped on ``--dry-run`` paths. Tests (9 new): - Empty / sparse data_dir → status=skipped with specific reasons - 0 tickers with loadings → status=skipped with #324-tagged reason - Happy path persists F.parquet + D.parquet + metadata.json to ``risk_model/{date}/`` prefix - F is square (load-bearing for the C.3 Σ = B·F·Bᵀ + D matmul) - ``dry_run=True`` builds the model but skips S3 write - S3 persist failure → status=ok + WARN (best-effort) - Malformed parquet silently skipped (graceful read) Suite: 1201 → 1210 (+9). Composes with PR #200 (C.2a F+D math primitives), alpha-engine-data #324 (C.1 ``*_zscore`` factor loadings), and ROADMAP C.2b → unblocks C.3 wiring once B.5 cutover passes and ≥4 weeks of F + D are in S3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ROADMAP C.2b — wires the C.2a math primitives (`risk_model.build_factor_risk_model`, #200) into the Saturday SF PredictorTraining stage and persists F + D parquets to `s3://{bucket}/risk_model/{date}/{F,D}.parquet` weekly. Plus `metadata.json` with build params + shape diagnostics.
This is the production-persistence layer for the structural factor risk model. The actual wiring of Σ = B · F · Bᵀ + D into `executor.portfolio_optimizer.solve_target_weights` (workstream C.3) is gated on the B.5 cutover gate per the plan doc's HARD SEQUENCING CONSTRAINT — two simultaneous Σ-substrate changes make backtester regressions untraceable. C.2b ships the weekly persistence so by the time C.3 reads `risk_model/{date}/`, there are ≥4 weeks of F + D accumulated.
What the stage does
~10s for the build, dominated by the cross-sectional OLS at each date:
Graceful degradation
Wiring
Wired into `training/train_handler.py:main()` as Step 2d2 (after training summary write, before triple-barrier cutover gate). Non-blocking: failure logs a WARN and does NOT abort the training pipeline. Skipped on `--dry-run` paths.
Tests (9 new)
Suite: 1201 → 1210 (+9).
Test plan
Composes with PR #200 (C.2a F+D math primitives), alpha-engine-data #324 (C.1 `*_zscore` factor loadings), and ROADMAP C.2b → unblocks C.3 wiring once B.5 cutover passes and ≥4 weeks of F + D are in S3.
🤖 Generated with Claude Code