feat: data quality gates — parquet validation + email fixes#1
Merged
Conversation
…l fixes Add validation of slim cache parquet slices after slicing, aggregate validation results into a per-run validation.json on S3, fix emailer detail extraction to match actual collector field names, and add set -o pipefail to the DriftDetection step. Includes 14 new tests covering validation checks and email detail extraction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 15, 2026
…nals
Two related bugs kept scanner/team/cio grading pinned at N/A even when
the merge on (ticker, eval_date) would otherwise have worked.
1. NULL return_5d rows got stuck forever. When a collector run inserted
a date whose 5d forward window hadn't closed yet, the row landed with
NULL return/spy/beat columns. The next run's `existing` set contained
that eval_date, so `dates_to_process` skipped it — the returns were
never backfilled. Combined with `INSERT OR IGNORE` this silently
orphaned Fri 2026-04-03 and other edge cases.
Fix: `_get_existing_dates` now returns only eval_dates where
`return_5d IS NOT NULL`; `_insert_rows` uses INSERT OR REPLACE so a
reprocessed row overwrites the stale NULL one.
2. Weekend signal folders have no market data. Research runs before
alpha-engine-research 9a94e34 (2026-04-13, "Stamp signals with
next_trading_day") wrote signals/{Sat,Sun}/. Polygon has no
grouped-daily data for those dates, so rows landed with NULL returns
and then hit bug #1. Post-fix research stamps trading days, but
the legacy weekend folders still exist in S3 and would keep
retriggering the empty-prices path.
Fix: filter eval_dates to trading days (Mon-Fri) before enqueuing
for processing.
Existing tests pass (46/46). The next Saturday Step Function run will
reprocess the stuck Fri 2026-04-03 row and populate return_5d for any
subsequent trading days whose 5d window has since closed.
Out of scope: market-holiday filtering (e.g. Good Friday). _is_trading_day
only screens weekends; a closed-market weekday will still be attempted —
polygon returns empty and the row is skipped at _build_rows_for_date,
which is acceptable for now. Proper NYSE calendar filtering belongs in
a follow-up.
3 tasks
cipher813
added a commit
that referenced
this pull request
Apr 15, 2026
…nals (#38) Two related bugs kept scanner/team/cio grading pinned at N/A even when the merge on (ticker, eval_date) would otherwise have worked. 1. NULL return_5d rows got stuck forever. When a collector run inserted a date whose 5d forward window hadn't closed yet, the row landed with NULL return/spy/beat columns. The next run's `existing` set contained that eval_date, so `dates_to_process` skipped it — the returns were never backfilled. Combined with `INSERT OR IGNORE` this silently orphaned Fri 2026-04-03 and other edge cases. Fix: `_get_existing_dates` now returns only eval_dates where `return_5d IS NOT NULL`; `_insert_rows` uses INSERT OR REPLACE so a reprocessed row overwrites the stale NULL one. 2. Weekend signal folders have no market data. Research runs before alpha-engine-research 9a94e34 (2026-04-13, "Stamp signals with next_trading_day") wrote signals/{Sat,Sun}/. Polygon has no grouped-daily data for those dates, so rows landed with NULL returns and then hit bug #1. Post-fix research stamps trading days, but the legacy weekend folders still exist in S3 and would keep retriggering the empty-prices path. Fix: filter eval_dates to trading days (Mon-Fri) before enqueuing for processing. Existing tests pass (46/46). The next Saturday Step Function run will reprocess the stuck Fri 2026-04-03 row and populate return_5d for any subsequent trading days whose 5d window has since closed. Out of scope: market-holiday filtering (e.g. Good Friday). _is_trading_day only screens weekends; a closed-market weekday will still be attempted — polygon returns empty and the row is skipped at _build_rows_for_date, which is acceptable for now. Proper NYSE calendar filtering belongs in a follow-up.
cipher813
added a commit
that referenced
this pull request
Apr 17, 2026
Encodes the union of downstream consumer contracts (research's PriceFetchError + MacroFetchError, predictor's _verify_arctic_fresh) on the producer side, so DataPhase1 fails BEFORE any downstream Lambda cold-start or spot-EC2 bootstrap that's doomed to fail at preflight. Why --- After Phase 7a/7c, every downstream module runs an identical ArcticDB + S3 freshness gate at its own preflight. If DataPhase1 lands a partial or stale write today, the failure surfaces 3× downstream — research Lambda PreflightError, predictor PipelineAbort, backtester PriceFetchError — each on a different worker, each emitting its own alarm + email. Operator sees 3 incidents that all trace to DataPhase1. This collapses the blast radius to one alarm at the DataPhase1 step with the exact named contract violation, AND avoids ~5min of wasted spot-EC2 bootstrap per downstream worker (predictor training, backtester parity). Contract encoded ---------------- 1. ArcticDB macro.SPY last_date >= run_date - 1 (predictor strictness) 2. ArcticDB universe sample (20 random non-macro tickers) within 2d of SPY's last_date — catches partial writes 3. market_data/<run_date>/macro.json HEAD + parse + fed_funds_rate populated (research MacroFetchError contract) 4. market_data/<run_date>/constituents.json HEAD + parse + tickers >= 800 + sector_map dict (research PriceFetchError contract) 5. market_data/latest_weekly.json pointer date == run_date — catches the #1 silent-failure mode where dated artifacts write but pointer doesn't roll forward Files ----- * validators/postflight.py — DataPostflight class + PostflightError. Subclasses are NOT used here (BasePreflight in alpha-engine-lib has preflight semantics; postflight contract is different enough to warrant a focused class). * weekly_collector.py::_finalize — runs DataPostflight after _write_manifest + _write_validation_json, before _write_health_marker. Only fires for phase=1, not dry_run, only is None, and existing status=='ok'. On PostflightError, flips status to 'postflight_failed' and lets the existing main()'s SystemExit(1) propagate. * tests/test_postflight.py — 18 tests covering each contract's pass / fail mode plus _finalize wiring (status flip + skip when collection itself failed). Full suite 71/71 green. Phase gating ------------ Only Phase 1 today. Phase 2 (DataPhase2 alternative-data Lambda) gets its own postflight when the alt-data contract is encoded — different shape (per-ticker JSON, smaller universe). Daily SSM path eventually gets a daily postflight matching predictor inference's tighter freshness window. Validation ---------- Saturday 2026-04-18 00:00 UTC DataPhase1 run is the live test. Either clean success or a loud PostflightError naming the violated contract. The same Saturday cycle validates Phase 7c + VWAP writer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
cipher813
added a commit
that referenced
this pull request
Apr 17, 2026
…52) Encodes the union of downstream consumer contracts (research's PriceFetchError + MacroFetchError, predictor's _verify_arctic_fresh) on the producer side, so DataPhase1 fails BEFORE any downstream Lambda cold-start or spot-EC2 bootstrap that's doomed to fail at preflight. Why --- After Phase 7a/7c, every downstream module runs an identical ArcticDB + S3 freshness gate at its own preflight. If DataPhase1 lands a partial or stale write today, the failure surfaces 3× downstream — research Lambda PreflightError, predictor PipelineAbort, backtester PriceFetchError — each on a different worker, each emitting its own alarm + email. Operator sees 3 incidents that all trace to DataPhase1. This collapses the blast radius to one alarm at the DataPhase1 step with the exact named contract violation, AND avoids ~5min of wasted spot-EC2 bootstrap per downstream worker (predictor training, backtester parity). Contract encoded ---------------- 1. ArcticDB macro.SPY last_date >= run_date - 1 (predictor strictness) 2. ArcticDB universe sample (20 random non-macro tickers) within 2d of SPY's last_date — catches partial writes 3. market_data/<run_date>/macro.json HEAD + parse + fed_funds_rate populated (research MacroFetchError contract) 4. market_data/<run_date>/constituents.json HEAD + parse + tickers >= 800 + sector_map dict (research PriceFetchError contract) 5. market_data/latest_weekly.json pointer date == run_date — catches the #1 silent-failure mode where dated artifacts write but pointer doesn't roll forward Files ----- * validators/postflight.py — DataPostflight class + PostflightError. Subclasses are NOT used here (BasePreflight in alpha-engine-lib has preflight semantics; postflight contract is different enough to warrant a focused class). * weekly_collector.py::_finalize — runs DataPostflight after _write_manifest + _write_validation_json, before _write_health_marker. Only fires for phase=1, not dry_run, only is None, and existing status=='ok'. On PostflightError, flips status to 'postflight_failed' and lets the existing main()'s SystemExit(1) propagate. * tests/test_postflight.py — 18 tests covering each contract's pass / fail mode plus _finalize wiring (status flip + skip when collection itself failed). Full suite 71/71 green. Phase gating ------------ Only Phase 1 today. Phase 2 (DataPhase2 alternative-data Lambda) gets its own postflight when the alt-data contract is encoded — different shape (per-ticker JSON, smaller universe). Daily SSM path eventually gets a daily postflight matching predictor inference's tighter freshness window. Validation ---------- Saturday 2026-04-18 00:00 UTC DataPhase1 run is the live test. Either clean success or a loud PostflightError naming the violated contract. The same Saturday cycle validates Phase 7c + VWAP writer. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
May 14, 2026
…240) 2026-05-14 EOD recovery surfaced a real design gap in the EOD SF. When HandleFailure's SNS publish fails for any reason — malformed $.sns_topic_arn (today's recovery payload had a colon → space substitution between us-east-1 and the account ID), SNS throttling, IAM drift, transient outage — the entire SF aborted before reaching ForceStopInstance, leaving the trading EC2 running until manual stop. The state's own comment ("Failure alert via SNS — instance still stops to avoid cost") was unenforced. Two-part defensive fix: 1. Hardcode the SNS topic ARN (no $.sns_topic_arn indirection). The ARN is fixed; per-execution variability buys nothing and creates a corruption surface. Today's recovery-input space-instead-of-colon would have been impossible. 2. Catch States.ALL on HandleFailure → ForceStopInstance so the cost-guard fires regardless of alert delivery (defense-in-depth even with #1 in place — covers SNS outages, IAM drift, future failure modes). Live verification: deployed via update_eod_pipeline_sf.sh; describe- state-machine confirms `TopicArn` is literal + Catch routes to ForceStopInstance. Tests: full alpha-engine-data suite 1035 passed (was 1032; +3 wiring pins in test_sf_eod_substrate_check_wiring.py — TopicArn-is-literal, HandleFailure-has-States.ALL-catch-to-ForceStopInstance, no-state- binds-$.sns_topic_arn). Composes with PR #238 (today's daily_append column-order hotfix). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 18, 2026
…y path) (#259) ROADMAP "Friday shell-run — per-module dry-path activation" owed-item #1. Under the Friday shell_run, the DataPhase1/MorningEnrich + RAGIngestion spot states now boot the spot for real, run their EXISTING preflight, then exit 0 with ZERO external API data fetch and ZERO S3/ArcticDB/config/email/SNS writes — catching bootstrap-class breakage (lib-pin drift, sys.path collision, stale ArcticDB symbol, SSM timeout, Dockerfile/image gap) ~12h before the real Saturday run. Reuses the existing preflight substrate; no parallel preflight written. Where the gate sits / zero-fetch zero-write proof: - weekly_collector.py: new `--preflight-only` argparse flag. main() exits HERE — `raise SystemExit(0)` immediately after the existing `DataPreflight(config["bucket"], mode).run()` and strictly BEFORE `run_weekly(config, args)`. run_weekly() is the SOLE function in the module that performs ANY collector fetch (polygon/FMP/FRED/yfinance) or ANY S3/ArcticDB/parquet/config/module-health write — gating in front of it makes every fetch/write code path statically unreachable. The preflight itself only does read-only/auth probes (S3 HEAD, polygon/FRED reference-data auth calls that fetch no collector data, ArcticDB list_libraries) plus a self-cleaning S3 PUT+DELETE sentinel under preflight/ (the preflight's own liveness probe, not a data write). Ordering pinned by an AST-source test. - rag/pipelines/run_weekly_ingestion.sh: new `--preflight-only` flag. Exits 0 after Step 0 (`python -m rag.preflight`: check_env_vars + check_s3_bucket HEAD — read-only, zero fetch, zero write) and strictly BEFORE Step 1 (ingest_sec_filings). Every ingest_* pipeline, Voyage embedding call, and Postgres/pgvector + parquet write lives in Steps 1-9 — all unreachable once the guard exits. - infrastructure/spot_data_weekly.sh: new `--preflight-only` flag sets PREFLIGHT_ONLY=1, a MODIFIER orthogonal to RUN_MODE so it composes with the data path AND --rag-only. A dedicated data-path block runs `weekly_collector.py --morning-enrich --preflight-only` and/or `weekly_collector.py --phase 1 --preflight-only` (gated by the existing DO_MORNING_ENRICH/DO_PHASE1 split) then exit 0 before the real WORKLOADS heredoc — no prune (prune-audit JSON write), no RAG, no CloudWatch heartbeat, no S3 log upload. --rag-only --preflight-only behavior: runs ONLY the RAG-path preflight (boot + SSM secret fetch so rag.preflight's check_env_vars sees them + `run_weekly_ingestion.sh --preflight-only` = step-0-only + exit 0). No real RAG ingestion, no rag-ingestion heartbeat. `--preflight-only` alone runs ONLY the DataPhase1/MorningEnrich preflight. Universe-freshness tolerance note (ROADMAP owed-item #5): the Friday shell-run uses the phase1 / morning_enrich preflight modes. Per preflight.py::DataPreflight.run, NEITHER mode runs check_arcticdb_fresh — they only do _check_arcticdb_libraries_present (a presence read, not a freshness gate). morning_enrich deliberately omits freshness (it is part of what *makes* ArcticDB fresh); phase1 *populates* ArcticDB. The only freshness gate (check_arcticdb_fresh macro/SPY 4d) lives in the "daily" mode, which the Saturday/Friday data path never selects. So a Friday run predating Friday's settled polygon aggregate does NOT spuriously fail on a Thursday-last-bar — no --preflight-only-scoped tolerance code is required for the data path. Documented inline so a future mode-mapping change re-audits this invariant. Tests: new tests/test_preflight_only_dry_path.py (10 tests, static greps + AST-source assertions, matching the existing test_spot_data_weekly_run_modes.py / test_weekly_collector_preflight_ mode_mapping.py convention) pins: flag parsing on all 3 files, the exit-0-after-preflight-before-fetch/write ordering invariant, --rag-only --preflight-only step-0-only behavior, and the no-prune/no-RAG/no-heartbeat/no-S3-upload hard invariant. Full suite: 1229 passed, 1 skipped (pre-existing). bash -n clean on both shell scripts. No new deps, no secrets. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
cipher813
added a commit
that referenced
this pull request
May 25, 2026
…Phase 4 #1) (#309) Mirrors alpha-engine-research's ``llm_cost_tracker.RunBudgetExceededError`` pattern at the news-pipeline cost site (Phase 0.2 wiring, PR #308): ``S3CostBuffer.record()`` now tracks cumulative cost across the run and raises ``CostBudgetExceededError`` after recording the offending row if the per-run total exceeds ``ALPHA_ENGINE_RUN_BUDGET_USD`` (default $100, shared env var with research + executor — one operator knob ceilings cost across all SF entry points). **Failure shape:** row is recorded BEFORE the raise so per-call detail is preserved on S3 when the breaker fires. The pipeline-side try/finally in ``run_news_pipeline.py`` ensures the buffer flushes even when the breaker raises mid-loop — rows up to and including the breach call land on S3 so operators can diagnose what broke the budget without re-running. **Posture:** breaker propagates through the client proxy (``_CostTrackingMessages.create``) — generic record errors still get swallowed (event extraction's primary deliverable must survive a malformed-response hiccup), but ``CostBudgetExceededError`` is explicitly re-raised since swallowing it would defeat the safety net. **Operator-facing fields on the error:** ``run_id``, ``agent_id``, ``cumulative_cost_usd``, ``ceiling_usd``. Message tells operator how to adjust the env var if the ceiling needs raising. **Tests:** 4 ``TestRunBudgetCeilingResolution`` (default / env / zero disables / malformed-returns-zero) + 5 ``TestCostBudgetBreaker`` (under-ceiling no raise, breach raises after recording row, zero disables, proxy propagates, ceiling defaults from env). Suite 1484 → 1493 passing, zero regressions. **Composes with** PR #308 (the Phase 0.2 wiring) — the breaker is a small additive surface on the buffer; the production wiring path unchanged. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
May 25, 2026
…or (#310) Applies the standing rule per ``[[preference_llm_calls_confined_to_research_module]]`` — LLM calls live in alpha-engine-research. The news pipeline's Haiku-backed event extractor is removed and replaced with a deterministic classifier that uses two zero-cost signals already on the wire: 1. **Vendor tags** (``NewsArticle.tags``). Polygon emits keywords, GDELT emits structured event codes, Benzinga emits Channels. The ``alpha_engine_lib.sources.protocols.NewsArticle.tags`` docstring explicitly names this as "a soft signal for downstream event-flag extraction" — we were paying Haiku to re-derive what Polygon / GDELT already tagged. 2. **Title-keyword regex**. Backstop for sources that don't populate tags (Yahoo RSS). 17 pattern → category mappings against the ``DEFAULT_EVENT_CATEGORIES`` closed taxonomy. **Why this is the right answer, not a kill-switch:** code audit found the Haiku per-article structured output was aggregated to 5 scalar / list columns (``event_count``, ``event_severity_max/mean``, ``event_categories``, ``top_event_descriptions``) before any research consumer touched it. The "zero-shot novel-event detection" capability was mostly wasted — research only sees per-ticker rollups. Tag-based + keyword-based classification produces equivalent rollups deterministically. **Cost impact:** retires the largest previously-untracked LLM cost slice in the system per the original Phase 0 audit estimate ($20–60/mo). Actual spend on the deleted call site goes to $0; the research consumer sees identical EventFlag shape (extractor slug changes from ``"anthropic_haiku"`` to ``"rule_based"``) and identical aggregate columns in ``news_aggregates/{date}.parquet``. **Substrate cleanup:** retires three files added earlier this session: - ``collectors/nlp/event_extraction.py`` (the Anthropic extractor itself) - ``rag/pipelines/_cost_telemetry.py`` (Phase 0.2 cost-telemetry buffer, PR #308 + Phase 4 #1 runaway-cost breaker, PR #309 — both retired with the LLM call site they instrumented) - ``tests/test_news_cost_telemetry.py`` (mirrored tests) ``DEFAULT_EVENT_CATEGORIES`` moves into the new ``collectors/nlp/rule_based_event_extraction.py`` so the closed taxonomy stays accessible to downstream consumers. **Protocol contract:** ``EventExtractor.extract`` gains an optional ``article_tags: tuple[str, ...] = ()`` kwarg (back-compat default). The pipeline plumbs the tag union across article variants. Any future EventExtractor implementation (FinBERT, spaCy, reactivated LLM via research module) consumes the same shape. **Severity convention:** rule-based flags emit ``severity=0.5`` uniformly (the EventFlag protocol's documented default). The Haiku severity was a free-floating judgment never tuned by any operator alert. Per-category severity tuning can be added via YAML if a downstream surface needs it. **Tests:** ``TestRuleBasedEventExtractor`` (10 tests) covers empty-text short-circuit, no-match returns empty, title-keyword classification per category (earnings / M&A / FDA), tag-based classification (Polygon/GDELT shape), tag+title union, multi-category emission, deterministic ordering per ``DEFAULT_EVENT_CATEGORIES``, zero-LLM-dependency contract, title-as-description shape. Suite 1493 → 1479 net (retired the 9 cost-telemetry tests + 7 Anthropic extractor tests; added 10 rule-based tests). **Composes with:** - ``[[preference_llm_calls_confined_to_research_module]]`` — the rule this PR enforces - alpha-engine #212 (executor EOD narrative kill switch) — sibling application of the same rule. Two non-research LLM call sites; this PR retires data's entirely, executor's keeps the kill switch substrate (default off) since the LLM path may be operator-reactivated. - Retires the substrate from data #308 + data #309 (Phase 0.2 + Phase 4 #1 cost-telemetry buffer + breaker) — both became dead code with the LLM call site they instrumented. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Jun 2, 2026
…ckoff/weekday-pull/restatement-severity (L4482/L4480/L4483/L4486) (#363) Follow-ups to the 2026-06-01 FRED-429 / polygon-timeout incident (#354 made the system resilient to a missed value; these close the surrounding gaps the incident exposed). L4482 — macro backstop survives a transient polygon-fetch exception. collect() Step 1 now catches the TRANSIENT network class (requests Timeout/ConnectionError + PolygonRateLimitError) in polygon_only mode, logs loudly, and falls through to FRED (Step 2) + the macro yfinance backstop (Step 3) — the exact gap that failed recovery re-run #1 (a polygon read-timeout aborted collect() before the FRED-index macro keys could fill). Narrow by design: PolygonForbiddenError (403) and the deliberate "0 tickers" empty-data RuntimeError still propagate with their own messages; a real equity outage still hard-fails at the coverage gate (0 equity records -> < 95%), so the catch cannot mask equity data loss. L4480 — FRED backoff + jitter. New _fred_get_with_retry: bounded exponential backoff + full jitter on the transient class (429 / 5xx / timeout / connection), honors a server Retry-After when present, and raises immediately on a deterministic 4xx (no point retrying a bad series_id). Stops the 429 storm at the source rather than only tolerating a missed value. Mirrors the in-repo polygon_client retry idiom; no new dep. L4483 — weekday SF MorningEnrich now git-pulls. step_function_daily.json MorningEnrich adds the same `git -C ... pull --ff-only origin main` for alpha-engine-data + alpha-engine-config that the Saturday SF already runs, so a same-day recovery re-run on a still-running instance no longer executes stale code (cost a manual SSM pull deploying #354 this incident). L4486 — discrepancy ERROR severity scoped. _log_close_discrepancies now emits FRED-index restatements toward the authoritative value at WARN (`fred_restatement`, excluded from the flow-doctor ERROR filter) while keeping ERROR for genuine cross-source EQUITY drift. The reconciliation predictably restates VIX/TNX >5% when healing a 429-clobbered or stale T-1 edge cell (5/14 VIX, 6/2) — desirable self-heals, not anomalies. The recording surface stays (per feedback_no_silent_fails) at the right level. Tests: new tests/test_daily_closes_coalesce_hardening.py (8) for the retry class, transient-non-fatal control flow + 403-still-propagates, and restatement-vs-equity severity. Fixed two affected fixtures to set status_code (the helper now inspects it). Full suite: 1785 passed. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
validate_parquet(), returning validation summary in result dict_write_validation_json()aggregates validation results from all collectors and writesmarket_data/weekly/{date}/validation.jsonto S3_extract_details()to handle actual field names from prices (refreshed,stale,failed,total) and slim_cache (written)set -o pipefailto DriftDetection stepTest plan
pytest tests/test_price_validator.py tests/test_emailer.py— 14 passpytest tests/— all 36 pass🤖 Generated with Claude Code