Skip to content

feat: DataPhase1 postflight — producer-side consumer-contract checks#52

Merged
cipher813 merged 1 commit into
mainfrom
feat/dataphase1-postflight
Apr 17, 2026
Merged

feat: DataPhase1 postflight — producer-side consumer-contract checks#52
cipher813 merged 1 commit into
mainfrom
feat/dataphase1-postflight

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Encodes the union of downstream consumer contracts on the producer side, so DataPhase1 fails BEFORE any downstream Lambda cold-start or spot-EC2 bootstrap that's doomed to fail at preflight.

After Phase 7a/7c, every downstream module runs an identical ArcticDB + S3 freshness gate at its own preflight. If DataPhase1 lands a partial or stale write today, the failure surfaces 3× downstream — research Lambda `PreflightError`, predictor `PipelineAbort`, backtester `PriceFetchError` — each emitting its own alarm + email. This collapses the blast radius to one alarm at the DataPhase1 step with the exact named contract violation.

Contracts encoded

  1. ArcticDB `macro.SPY` last_date ≥ `run_date - 1` (predictor strictness)
  2. ArcticDB `universe` sample (20 random non-macro tickers) within 2d of SPY's last_date — catches partial writes
  3. `market_data/<run_date>/macro.json` HEAD + parse + `fed_funds_rate` populated (research `MacroFetchError` contract)
  4. `market_data/<run_date>/constituents.json` HEAD + parse + tickers ≥ 800 + sector_map dict (research `PriceFetchError` contract)
  5. `market_data/latest_weekly.json` pointer date == run_date — catches the feat: data quality gates — parquet validation + email fixes #1 silent-failure mode where dated artifacts write but pointer doesn't roll forward

Cost / compute waste avoided

Layer Wasted today on a DataPhase1 partial write After this PR
Research Lambda 30s container pull + venv init, then PreflightError Skipped
Predictor Training spot ~5min c5.xlarge bootstrap, then PipelineAbort Skipped
Backtester parity spot ~5min c5.large bootstrap, then PriceFetchError Skipped
Step Function transitions 3× HandleFailure + NotifyComplete 1× HandleFailure
Alarm emails 3 (one per layer) 1 (DataPhase1 named)
Postflight cost n/a ~1s of S3 + ArcticDB reads

Files

  • `validators/postflight.py` — `DataPostflight` class + `PostflightError`. Standalone (not subclassing BasePreflight; postflight semantics differ enough).
  • `weekly_collector.py::_finalize` — runs postflight after `_write_manifest` + `_write_validation_json`, before `_write_health_marker`. Gated to phase=1, not dry_run, only is None, status=='ok'. On `PostflightError`, flips status to `postflight_failed` and lets existing `main()`'s `SystemExit(1)` propagate.
  • `tests/test_postflight.py` — 18 tests covering each contract pass/fail + `_finalize` wiring. Full suite 71/71 green.

Phase gating

Only Phase 1 today. Phase 2 alt-data and daily SSM paths get their own postflights as future work (different contracts).

Validation

  • 71/71 tests green locally.
  • Saturday 2026-04-18 00:00 UTC DataPhase1 run is the live validation. Either clean success or a loud `PostflightError` naming the violated contract. Same Saturday cycle validates Phase 7c (research) + VWAP writer.

🤖 Generated with Claude Code

Encodes the union of downstream consumer contracts (research's
PriceFetchError + MacroFetchError, predictor's _verify_arctic_fresh) on
the producer side, so DataPhase1 fails BEFORE any downstream Lambda
cold-start or spot-EC2 bootstrap that's doomed to fail at preflight.

Why
---
After Phase 7a/7c, every downstream module runs an identical ArcticDB +
S3 freshness gate at its own preflight. If DataPhase1 lands a partial
or stale write today, the failure surfaces 3× downstream — research
Lambda PreflightError, predictor PipelineAbort, backtester PriceFetchError
— each on a different worker, each emitting its own alarm + email.
Operator sees 3 incidents that all trace to DataPhase1.

This collapses the blast radius to one alarm at the DataPhase1 step
with the exact named contract violation, AND avoids ~5min of wasted
spot-EC2 bootstrap per downstream worker (predictor training, backtester
parity).

Contract encoded
----------------
1. ArcticDB macro.SPY last_date >= run_date - 1 (predictor strictness)
2. ArcticDB universe sample (20 random non-macro tickers) within 2d
   of SPY's last_date — catches partial writes
3. market_data/<run_date>/macro.json HEAD + parse + fed_funds_rate
   populated (research MacroFetchError contract)
4. market_data/<run_date>/constituents.json HEAD + parse + tickers >= 800
   + sector_map dict (research PriceFetchError contract)
5. market_data/latest_weekly.json pointer date == run_date — catches
   the #1 silent-failure mode where dated artifacts write but pointer
   doesn't roll forward

Files
-----
* validators/postflight.py — DataPostflight class + PostflightError.
  Subclasses are NOT used here (BasePreflight in alpha-engine-lib has
  preflight semantics; postflight contract is different enough to
  warrant a focused class).
* weekly_collector.py::_finalize — runs DataPostflight after
  _write_manifest + _write_validation_json, before _write_health_marker.
  Only fires for phase=1, not dry_run, only is None, and existing
  status=='ok'. On PostflightError, flips status to 'postflight_failed'
  and lets the existing main()'s SystemExit(1) propagate.
* tests/test_postflight.py — 18 tests covering each contract's pass /
  fail mode plus _finalize wiring (status flip + skip when collection
  itself failed). Full suite 71/71 green.

Phase gating
------------
Only Phase 1 today. Phase 2 (DataPhase2 alternative-data Lambda) gets
its own postflight when the alt-data contract is encoded — different
shape (per-ticker JSON, smaller universe). Daily SSM path eventually
gets a daily postflight matching predictor inference's tighter
freshness window.

Validation
----------
Saturday 2026-04-18 00:00 UTC DataPhase1 run is the live test.
Either clean success or a loud PostflightError naming the violated
contract. The same Saturday cycle validates Phase 7c + VWAP writer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit d071ed1 into main Apr 17, 2026
1 check passed
@cipher813 cipher813 deleted the feat/dataphase1-postflight branch April 17, 2026 14:30
cipher813 added a commit that referenced this pull request Apr 17, 2026
Complements postflight (PR #52) which validates output contracts AFTER
writes. Preflight runs at the START of _run_phase1 before any collector
executes, catching the failure class where a credential drift, external-
API outage, or misconfiguration would otherwise burn ~55min of spot-EC2
time before failing on the 40th collector call.

Five fail-fast checks (cheapest-first ordering, total ~10s wall-time):

  1. Required env vars present + non-empty (POLYGON_API_KEY, FRED_API_KEY)
  2. S3 bucket HEAD + sentinel PUT/DELETE (validates IAM beyond list)
  3. ArcticDB connect + universe/macro libraries present
  4. FRED reachable + auth valid (DFF observation call)
  5. polygon.io reachable + auth valid (reference-data call)

Failure semantics match postflight: PreflightError → results["status"]
= "preflight_failed" → SystemExit(1) → SSM → SF HandleFailure →
alpha-engine-saturday-sf-failed alarm. Single named failure surface,
same blast radius as postflight (one alarm, not three downstream).

Skipped on --only single-collector runs (avoids preflight overhead for
local debug invocations like `python weekly_collector.py --only macro`).
Skipped on --phase 2 (Lambda has different dependency surface; gets its
own preflight when needed).

Tests: 24 new unit tests covering each check + end-to-end run() ordering
+ short-circuit on first failure. Full data-module suite: 84 passed.

ROADMAP: adds P2 (per-collector value-range outlier guard) and P3 (data
contract registry / Pydantic schema-first validation) as the natural
follow-ups for pushing validation further upstream/centralizing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 17, 2026
…#55)

Complements postflight (PR #52) which validates output contracts AFTER
writes. Preflight runs at the START of _run_phase1 before any collector
executes, catching the failure class where a credential drift, external-
API outage, or misconfiguration would otherwise burn ~55min of spot-EC2
time before failing on the 40th collector call.

Five fail-fast checks (cheapest-first ordering, total ~10s wall-time):

  1. Required env vars present + non-empty (POLYGON_API_KEY, FRED_API_KEY)
  2. S3 bucket HEAD + sentinel PUT/DELETE (validates IAM beyond list)
  3. ArcticDB connect + universe/macro libraries present
  4. FRED reachable + auth valid (DFF observation call)
  5. polygon.io reachable + auth valid (reference-data call)

Failure semantics match postflight: PreflightError → results["status"]
= "preflight_failed" → SystemExit(1) → SSM → SF HandleFailure →
alpha-engine-saturday-sf-failed alarm. Single named failure surface,
same blast radius as postflight (one alarm, not three downstream).

Skipped on --only single-collector runs (avoids preflight overhead for
local debug invocations like `python weekly_collector.py --only macro`).
Skipped on --phase 2 (Lambda has different dependency surface; gets its
own preflight when needed).

Tests: 24 new unit tests covering each check + end-to-end run() ordering
+ short-circuit on first failure. Full data-module suite: 84 passed.

ROADMAP: adds P2 (per-collector value-range outlier guard) and P3 (data
contract registry / Pydantic schema-first validation) as the natural
follow-ups for pushing validation further upstream/centralizing.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant