feat: DataPhase1 postflight — producer-side consumer-contract checks#52
Merged
Conversation
Encodes the union of downstream consumer contracts (research's PriceFetchError + MacroFetchError, predictor's _verify_arctic_fresh) on the producer side, so DataPhase1 fails BEFORE any downstream Lambda cold-start or spot-EC2 bootstrap that's doomed to fail at preflight. Why --- After Phase 7a/7c, every downstream module runs an identical ArcticDB + S3 freshness gate at its own preflight. If DataPhase1 lands a partial or stale write today, the failure surfaces 3× downstream — research Lambda PreflightError, predictor PipelineAbort, backtester PriceFetchError — each on a different worker, each emitting its own alarm + email. Operator sees 3 incidents that all trace to DataPhase1. This collapses the blast radius to one alarm at the DataPhase1 step with the exact named contract violation, AND avoids ~5min of wasted spot-EC2 bootstrap per downstream worker (predictor training, backtester parity). Contract encoded ---------------- 1. ArcticDB macro.SPY last_date >= run_date - 1 (predictor strictness) 2. ArcticDB universe sample (20 random non-macro tickers) within 2d of SPY's last_date — catches partial writes 3. market_data/<run_date>/macro.json HEAD + parse + fed_funds_rate populated (research MacroFetchError contract) 4. market_data/<run_date>/constituents.json HEAD + parse + tickers >= 800 + sector_map dict (research PriceFetchError contract) 5. market_data/latest_weekly.json pointer date == run_date — catches the #1 silent-failure mode where dated artifacts write but pointer doesn't roll forward Files ----- * validators/postflight.py — DataPostflight class + PostflightError. Subclasses are NOT used here (BasePreflight in alpha-engine-lib has preflight semantics; postflight contract is different enough to warrant a focused class). * weekly_collector.py::_finalize — runs DataPostflight after _write_manifest + _write_validation_json, before _write_health_marker. Only fires for phase=1, not dry_run, only is None, and existing status=='ok'. On PostflightError, flips status to 'postflight_failed' and lets the existing main()'s SystemExit(1) propagate. * tests/test_postflight.py — 18 tests covering each contract's pass / fail mode plus _finalize wiring (status flip + skip when collection itself failed). Full suite 71/71 green. Phase gating ------------ Only Phase 1 today. Phase 2 (DataPhase2 alternative-data Lambda) gets its own postflight when the alt-data contract is encoded — different shape (per-ticker JSON, smaller universe). Daily SSM path eventually gets a daily postflight matching predictor inference's tighter freshness window. Validation ---------- Saturday 2026-04-18 00:00 UTC DataPhase1 run is the live test. Either clean success or a loud PostflightError naming the violated contract. The same Saturday cycle validates Phase 7c + VWAP writer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 17, 2026
Complements postflight (PR #52) which validates output contracts AFTER writes. Preflight runs at the START of _run_phase1 before any collector executes, catching the failure class where a credential drift, external- API outage, or misconfiguration would otherwise burn ~55min of spot-EC2 time before failing on the 40th collector call. Five fail-fast checks (cheapest-first ordering, total ~10s wall-time): 1. Required env vars present + non-empty (POLYGON_API_KEY, FRED_API_KEY) 2. S3 bucket HEAD + sentinel PUT/DELETE (validates IAM beyond list) 3. ArcticDB connect + universe/macro libraries present 4. FRED reachable + auth valid (DFF observation call) 5. polygon.io reachable + auth valid (reference-data call) Failure semantics match postflight: PreflightError → results["status"] = "preflight_failed" → SystemExit(1) → SSM → SF HandleFailure → alpha-engine-saturday-sf-failed alarm. Single named failure surface, same blast radius as postflight (one alarm, not three downstream). Skipped on --only single-collector runs (avoids preflight overhead for local debug invocations like `python weekly_collector.py --only macro`). Skipped on --phase 2 (Lambda has different dependency surface; gets its own preflight when needed). Tests: 24 new unit tests covering each check + end-to-end run() ordering + short-circuit on first failure. Full data-module suite: 84 passed. ROADMAP: adds P2 (per-collector value-range outlier guard) and P3 (data contract registry / Pydantic schema-first validation) as the natural follow-ups for pushing validation further upstream/centralizing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
cipher813
added a commit
that referenced
this pull request
Apr 17, 2026
…#55) Complements postflight (PR #52) which validates output contracts AFTER writes. Preflight runs at the START of _run_phase1 before any collector executes, catching the failure class where a credential drift, external- API outage, or misconfiguration would otherwise burn ~55min of spot-EC2 time before failing on the 40th collector call. Five fail-fast checks (cheapest-first ordering, total ~10s wall-time): 1. Required env vars present + non-empty (POLYGON_API_KEY, FRED_API_KEY) 2. S3 bucket HEAD + sentinel PUT/DELETE (validates IAM beyond list) 3. ArcticDB connect + universe/macro libraries present 4. FRED reachable + auth valid (DFF observation call) 5. polygon.io reachable + auth valid (reference-data call) Failure semantics match postflight: PreflightError → results["status"] = "preflight_failed" → SystemExit(1) → SSM → SF HandleFailure → alpha-engine-saturday-sf-failed alarm. Single named failure surface, same blast radius as postflight (one alarm, not three downstream). Skipped on --only single-collector runs (avoids preflight overhead for local debug invocations like `python weekly_collector.py --only macro`). Skipped on --phase 2 (Lambda has different dependency surface; gets its own preflight when needed). Tests: 24 new unit tests covering each check + end-to-end run() ordering + short-circuit on first failure. Full data-module suite: 84 passed. ROADMAP: adds P2 (per-collector value-range outlier guard) and P3 (data contract registry / Pydantic schema-first validation) as the natural follow-ups for pushing validation further upstream/centralizing. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Encodes the union of downstream consumer contracts on the producer side, so DataPhase1 fails BEFORE any downstream Lambda cold-start or spot-EC2 bootstrap that's doomed to fail at preflight.
After Phase 7a/7c, every downstream module runs an identical ArcticDB + S3 freshness gate at its own preflight. If DataPhase1 lands a partial or stale write today, the failure surfaces 3× downstream — research Lambda `PreflightError`, predictor `PipelineAbort`, backtester `PriceFetchError` — each emitting its own alarm + email. This collapses the blast radius to one alarm at the DataPhase1 step with the exact named contract violation.
Contracts encoded
Cost / compute waste avoided
Files
Phase gating
Only Phase 1 today. Phase 2 alt-data and daily SSM paths get their own postflights as future work (different contracts).
Validation
🤖 Generated with Claude Code