Convert silent fails in daily production path to hard fails by cipher813 · Pull Request #25 · cipher813/alpha-engine-data

cipher813 · 2026-04-14T15:03:49Z

Summary

Follow-up to #24. Audits the rest of the alpha-engine-data daily production path for the `except Exception: log.debug / pass` pattern that masked ArcticDB staleness for two days. Scope limited to files on the DailyData + feature-store write path; RAG, emailer, and fundamentals scan deferred.

Changes

`features/compute.py`

`_load_daily_closes_delta`: NoSuchKey → WARNING; every other S3 exception raises; raise if all dates in the business-day range are missing.
Per-ticker feature compute failure: `log.debug` → `log.warning`; RuntimeError if `n_err / len(universe_tickers) > 5%` (matches daily_append).
Empty `store_rows` raises instead of `return {status: error}`.
`_load_cached_alternative` outer except: `log.debug` → `log.warning`.
Schema version write: upgraded to WARNING, kept non-raising (drift metadata, not features).

`collectors/daily_closes.py`

head_object idempotency check: bare except → only `ClientError` with 404/NoSuchKey. Auth / throttling now raises.
Per-ticker yfinance extract: `log.debug` → `log.warning`.

`collectors/macro.py`

`load_from_s3`: pointer-missing returns None; other errors raise.

`weekly_collector.py`

S3 constituents load + Wikipedia fallback: bare `except pass` → WARNING + ERROR with context (fallback chain preserved).

Dead code flagged (defer to cleanup PR)

`features/reader.py` — `read_feature_snapshot / read_feature_range / latest_available_date / read_registry`. Zero callers in alpha-engine-data. Cross-repo consumers (predictor, backtester, research) will migrate away as ArcticDB cutover lands. Safe to delete once migration is confirmed.

Out of scope (tracked for follow-up audit)

`rag/*` (SEC / 8-K / earnings / theses ingestion)
`emailer.py` `except Exception: pass` in finalize email
`features/compute.py` fundamentals/alternative per-key fallback chains
`collectors/fundamentals.py` per-ticker fetch `log.debug`

Test plan

`pytest tests/ --ignore=tests/integration -q` — 41 passed
Saturday weekly Step Function run (first exercise of hardened feature compute path on full universe)
Next weekday DailyData run (exercises hardened `_load_daily_closes_delta`)

🤖 Generated with Claude Code

Follow-up to PR #24 (daily_append). Audits the rest of the alpha-engine-data daily production path for the same "except Exception: log.debug / pass" pattern that masked ArcticDB staleness for two days. Scope limited to files on the DailyData + feature-store write path; RAG ingestion, emailer, and fundamentals scan deferred to a later sweep. features/compute.py - `_load_daily_closes_delta`: per-date NoSuchKey upgraded to WARNING; every other S3 exception raises; raise if ALL dates in the business-day range were missing (the fingerprint of an upstream daily_closes outage). - Per-ticker feature computation failure: log.debug → log.warning; new RuntimeError if `n_err / len(universe_tickers) > 5%`, matching daily_append. - Empty store_rows now raises instead of returning `status=error` (status return is legacy; raising is consistent with hard-fail). - `_load_cached_alternative` outer except: log.debug → log.warning so auth / network failures surface even when "no alt data" is expected. - Schema version write: log.debug → log.warning with a comment explaining why this one stays non-raising (drift-check metadata, not feature data). collectors/daily_closes.py - head_object idempotency check: bare `except Exception: pass` → catch only ClientError with 404/NoSuchKey. Auth / throttling now raises instead of silently proceeding. - Per-ticker yfinance extract: log.debug → log.warning so partial yfinance failures are visible in the daily log. collectors/macro.py - `load_from_s3`: pointer-missing still returns None (expected), but every other error now raises instead of masquerading as "no data." weekly_collector.py - S3 constituents load fallback: bare `except Exception: pass` → warning with context. The Wikipedia fallback remains (legitimate failover). - Wikipedia constituents failure: bare `except Exception: pass` → ERROR log. The downstream `if not tickers` guard already hard-fails. Out of scope (tracked for follow-up audit) - rag/* (SEC / 8-K / earnings / theses ingestion) - emailer.py `except Exception: pass` in finalize email - features/compute.py fundamentals/alternative per-key fallback chains - collectors/fundamentals.py per-ticker fetch log.debug Dead code flagged for later removal - features/reader.py — read_feature_snapshot / read_feature_range / latest_available_date / read_registry. No callers remain inside alpha-engine-data. Consumers in sibling repos (predictor / backtester / research) will migrate away as the ArcticDB cutover completes; safe to delete once the cross-repo migration is confirmed clean. Tests: 41/41 pass. No new tests added — the hard-fail paths are essentially `if cond: raise` and are better exercised by the existing integration test suite against a live S3 bucket (follow-up). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hardened failures from PRs #24 + #25 now raise cleanly up through _run_daily and weekly_collector.main(). The pipeline exits non-zero, but there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from logger.exception) fires email + a deduped GitHub issue. Mirrors the executor's integration (alpha-engine/executor/log_config.py, alpha-engine/flow-doctor.yaml) verbatim: - requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0 - log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level, attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is inside _attach_flow_doctor so local dev without the dep installed still works. - flow-doctor.yaml.example: committed template. Real file is gitignored — will be staged from alpha-engine-config at deploy time (same pattern as predictor.yaml, risk.yaml). Out of scope for this PR (deployment steps — user action required) - Add flow-doctor.yaml to alpha-engine-config repo at path matching what the Step Functions expect. - Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already in /home/ec2-user/.alpha-engine.env (executor already uses these, so likely yes, but needs confirmation). - First live fire: next daily or Saturday Step Function run — expect an email + issue if any of the new hard-fail paths trigger. Out of scope (tracked) - Same integration for predictor Lambda (different deploy model — needs flow-doctor packaged into the Lambda image or layer). - Same integration for research Lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses the class of failure surfaced 2026-04-14: daily_append silently not writing to ArcticDB for two weekdays because arcticdb wasn't in the deploy image and the import error was swallowed. A preflight check catches this in ~1s instead of letting the pipeline run to "success" with stale data. Pattern D (simplest): inline _preflight() in weekly_collector.py, called from main() after config load, before run_weekly(). No new files, no shared library. If the check pattern proves valuable across the other modules, we can extract a common helper later — but the per-repo checks are small enough (~30 LOC) that inlining is fine for now. Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT correctness of the collection itself. The hardened collectors from PRs #24 + #25 still own data-integrity hard-fails. Checks by mode - daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend + buffer). 4-day stale SPY would have caught today's bug on 2026-04-14 instead of letting Friday's write look healthy until Saturday. - phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable. - phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable. Failures raise RuntimeError. main() already exits 1 on any SystemExit path, and flow-doctor (#26, once deployed) will dispatch the corresponding ERROR log as email + GitHub issue. Out of scope (tracked) - Same pattern in predictor inference + training, research Lambda, executor entrypoints, backtester. Rolling out after this first consumer proves the shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Applies the PR #24/#25/#28 pattern to the Saturday RAG ingestion path. The shell script was silently swallowing failures in all 5 pipelines, making "RAG Weekly Ingestion Complete" a lie whenever any step failed. Shell script (rag/pipelines/run_weekly_ingestion.sh) - Removed `|| echo "WARNING: ... (non-fatal)"` from all 5 ingestion steps, the CloudWatch heartbeat, and the completion email. set -e was already active but these swallowers defeated it. - Removed the runtime `if [ -n "$FINNHUB_API_KEY" ]; then ... else echo SKIPPED fi` branch. All required env vars are now hard-failed by preflight (step 0) before any ingestion runs. A silently-skipped earnings transcript step defeats the purpose of having transcripts at all. - Added `Step 0/5: python -m rag.preflight` at the top. - The hardcoded 'status: ok' completion email is now truthful rather than aspirational — with set -e active and no swallowers, reaching the email means all 5 pipelines actually succeeded. New file: rag/preflight.py - RAGPreflight(BasePreflight) subclass — composes check_env_vars (AWS_REGION, VOYAGE_API_KEY, FINNHUB_API_KEY, EDGAR_IDENTITY, RAG_DATABASE_URL) + check_s3_bucket. - main() uses alpha-engine-lib's setup_logging with the shared flow-doctor.yaml path, so a preflight failure fires email + issue via the existing dispatch. rag/db.py - is_available(): log.debug → log.warning for the exception path. The function was otherwise unchanged — it's a non-raising probe for future retrieval-side consumers. Flagged as unused inside alpha-engine-data (zero callers); defer deletion until cross-repo audit completes, since predictor / research / backtester may import from it. rag/pipelines/ingest_8k_filings.py - Per-URL download failure: log.debug → log.warning. Caller still treats None as "skip this filing" (aggregate counts are reported), so no behavior change; the failure rate is just visible now. Dead code flagged (no change in this PR) - rag/db.py::is_available — zero local callers. Keep for now, flag for future cross-repo sweep. Out of scope (tracked) - Adopt alpha-engine-lib setup_logging in each ingestion script's main() for consistent log formatting + flow-doctor capture of per-pipeline errors. Currently only preflight.py uses the lib; ingestion scripts still use Python's default root logger. Minor follow-up. - Date-parsing `except ValueError: continue` patterns in ingest_sec_filings, ingest_8k_filings, ingest_theses, ingest_earnings_transcripts. Reviewed case-by-case — all are legitimate "skip this malformed entry" flows with aggregate counts upstream. Not silent fails. Test plan - [x] pytest tests/ — 41 pass - [x] Syntax check on all modified Python files - [x] bash -n on run_weekly_ingestion.sh - [ ] Next Saturday Step Function run exercises the hardened path. Forced failure test: unset FINNHUB_API_KEY on EC2 and re-run — must fail at preflight (step 0), not silently skip step 3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cipher813 merged commit b11379a into main Apr 14, 2026
1 check passed

cipher813 deleted the fix/silent-fail-audit branch April 14, 2026 15:06

cipher813 mentioned this pull request Apr 14, 2026

Integrate flow-doctor for failure alerting #26

Merged

5 tasks

cipher813 mentioned this pull request Apr 14, 2026

Add preflight checks to weekly_collector entrypoint #27

Closed

4 tasks

cipher813 mentioned this pull request Apr 14, 2026

RAG hardening — remove silent fails + add preflight #31

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert silent fails in daily production path to hard fails#25

Convert silent fails in daily production path to hard fails#25
cipher813 merged 1 commit into
mainfrom
fix/silent-fail-audit

cipher813 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 14, 2026

Summary

Changes

Dead code flagged (defer to cleanup PR)

Out of scope (tracked for follow-up audit)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant