Integrate flow-doctor for failure alerting by cipher813 · Pull Request #26 · cipher813/alpha-engine-data

cipher813 · 2026-04-14T15:12:10Z

Summary

PRs #24 + #25 hardened the alpha-engine-data daily production path so failures raise and exit non-zero. But there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from `logger.exception`) fires email + a deduped GitHub issue.

Mirrors the executor's integration (`alpha-engine/executor/log_config.py`, `alpha-engine/flow-doctor.yaml`) verbatim.

Changes

`requirements.txt` — `flow-doctor[diagnosis]>=0.3.0,<0.4.0`
`log_config.py` — `FlowDoctor` singleton + `FlowDoctorHandler` at ERROR level attached to root logger when `FLOW_DOCTOR_ENABLED=1`. Lazy import so local dev without the dep installed still works.
`flow-doctor.yaml.example` — committed template. Points at `cipher813/alpha-engine-data` for GitHub issues and uses the same shared SMTP env vars as the executor config.
`.gitignore` — `flow-doctor.yaml` ignored (real file will be staged from `alpha-engine-config`).

Deployment steps (user action — out of scope for this PR)

Add `flow-doctor.yaml` to `alpha-engine-config` repo at a path the deploy pipeline can stage from.
Confirm `FLOW_DOCTOR_ENABLED=1` and `FLOW_DOCTOR_GITHUB_TOKEN` are already in `/home/ec2-user/.alpha-engine.env`. The executor uses the same vars, so likely yes — needs confirmation.
First live fire: next daily or Saturday Step Function run. If any of the hard-fail paths added in PRs Hard-fail daily_append + propagate exit code in weekday SSM #24 + Convert silent fails in daily production path to hard fails #25 trigger, expect an email + an issue on this repo.

Out of scope (tracked for follow-up)

Same integration for predictor Lambda (different deploy model — flow-doctor needs to be packaged into the Lambda image or a layer).
Same integration for research Lambda.

Test plan

`pytest tests/ --ignore=tests/integration -q` — 41 passed
`python -c 'from log_config import setup_logging; setup_logging("test")'` — no-op when `FLOW_DOCTOR_ENABLED` unset, matches local-dev expectation
EC2 install flow-doctor after PR merge (`pip install -r requirements.txt`)
Stage `flow-doctor.yaml` from alpha-engine-config
Next pipeline run — confirm flow-doctor init logs at INFO; no spurious errors on healthy run; a forced failure (e.g., `aws s3 rm` the daily_closes parquet) produces an email + issue

🤖 Generated with Claude Code

Hardened failures from PRs #24 + #25 now raise cleanly up through _run_daily and weekly_collector.main(). The pipeline exits non-zero, but there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from logger.exception) fires email + a deduped GitHub issue. Mirrors the executor's integration (alpha-engine/executor/log_config.py, alpha-engine/flow-doctor.yaml) verbatim: - requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0 - log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level, attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is inside _attach_flow_doctor so local dev without the dep installed still works. - flow-doctor.yaml.example: committed template. Real file is gitignored — will be staged from alpha-engine-config at deploy time (same pattern as predictor.yaml, risk.yaml). Out of scope for this PR (deployment steps — user action required) - Add flow-doctor.yaml to alpha-engine-config repo at path matching what the Step Functions expect. - Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already in /home/ec2-user/.alpha-engine.env (executor already uses these, so likely yes, but needs confirmation). - First live fire: next daily or Saturday Step Function run — expect an email + issue if any of the new hard-fail paths trigger. Out of scope (tracked) - Same integration for predictor Lambda (different deploy model — needs flow-doctor packaged into the Lambda image or layer). - Same integration for research Lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses the class of failure surfaced 2026-04-14: daily_append silently not writing to ArcticDB for two weekdays because arcticdb wasn't in the deploy image and the import error was swallowed. A preflight check catches this in ~1s instead of letting the pipeline run to "success" with stale data. Pattern D (simplest): inline _preflight() in weekly_collector.py, called from main() after config load, before run_weekly(). No new files, no shared library. If the check pattern proves valuable across the other modules, we can extract a common helper later — but the per-repo checks are small enough (~30 LOC) that inlining is fine for now. Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT correctness of the collection itself. The hardened collectors from PRs #24 + #25 still own data-integrity hard-fails. Checks by mode - daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend + buffer). 4-day stale SPY would have caught today's bug on 2026-04-14 instead of letting Friday's write look healthy until Saturday. - phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable. - phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable. Failures raise RuntimeError. main() already exits 1 on any SystemExit path, and flow-doctor (#26, once deployed) will dispatch the corresponding ERROR log as email + GitHub issue. Out of scope (tracked) - Same pattern in predictor inference + training, research Lambda, executor entrypoints, backtester. Rolling out after this first consumer proves the shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the inline log_config.py (committed 2026-04-14 in PR #26) and the proposed-then-closed inline _preflight (PR #27) with the shared library. alpha-engine-lib is now the single source of truth for these cross-cutting concerns; drift across consumer repos becomes impossible by construction. Changes - requirements.txt: add alpha-engine-lib @ git+…@v0.1.0 with [arcticdb,flow_doctor] extras. Drop the direct flow-doctor pin — pulled in transitively by the extra. - preflight.py: new module with DataPreflight(BasePreflight). Composes mode-specific check sequences (daily / phase1 / phase2) on top of the shared primitives. ~40 LOC. - weekly_collector.py: - Import setup_logging from alpha_engine_lib.logging (was local log_config). Pass flow-doctor.yaml path explicitly since the lib version is path-parametric (each consumer has its own yaml). - Call DataPreflight(...).run() at the top of main(), after config load, before run_weekly(). - log_config.py: deleted. The lib version is now the sole copy. Test plan - [x] pytest tests/ — 41 pass - [x] Import smoke: `from preflight import DataPreflight` works against locally-installed alpha-engine-lib v0.1.0 - [ ] EC2: pip install -r requirements.txt must succeed. The git+ URL requires the existing ~/.netrc PAT to have Contents: read on cipher813/alpha-engine-lib. If it doesn't, the install fails loud and visibly (no silent fall-through). - [ ] Next weekday DailyData run: confirm "Pre-flight OK (mode=daily)" appears before "COLLECTING: daily closes". - [ ] Forced failure test: `FRED_API_KEY= python weekly_collector.py --phase 1 --dry-run` should raise at preflight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rce) Root cause fix for the 2026-04-15 predictor retrain outage where 904/909 tickers in the universe library had duplicate date rows when read back from ArcticDB. That failure was worked around defensively in the predictor loader (alpha-engine-predictor PR #26) and the inference loader has had equivalent defensive dedup for some time (load_prices.py line 403). This PR fixes the accumulation at the write site. Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol, today_row)` at the three daily write sites in builders/daily_append.py: - universe_lib.update(ticker, today_row) (line 251) - macro_lib.update(key, new_row) (line 269) - macro_lib.update(sym, new_row) (line 286) append() adds rows without dedup — if daily_append runs twice for the same date (race, retry, concurrent Saturday+Sunday pipelines), rows accumulate. update() is idempotent: ArcticDB replaces any existing rows whose dates overlap with the input DataFrame, so a re-run with the same or updated row produces at most one row per date regardless of invocation count. The read-check at line 195 (if today_ts in hist.index: skip) stays — it's an efficiency guard that avoids the write entirely when the row already exists. update() is the safety net when that check misses. tests/test_daily_append_semantics.py — source-level regression guards against a future revert to append() on any of the three sites. Follow-up: once this has been in place for 1-2 full Saturday cycles, remove the defensive dedup in alpha-engine-predictor data/dataset.py (`_load_ticker_parquet`). Track on ROADMAP under Data Platform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rce) (#35) Root cause fix for the 2026-04-15 predictor retrain outage where 904/909 tickers in the universe library had duplicate date rows when read back from ArcticDB. That failure was worked around defensively in the predictor loader (alpha-engine-predictor PR #26) and the inference loader has had equivalent defensive dedup for some time (load_prices.py line 403). This PR fixes the accumulation at the write site. Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol, today_row)` at the three daily write sites in builders/daily_append.py: - universe_lib.update(ticker, today_row) (line 251) - macro_lib.update(key, new_row) (line 269) - macro_lib.update(sym, new_row) (line 286) append() adds rows without dedup — if daily_append runs twice for the same date (race, retry, concurrent Saturday+Sunday pipelines), rows accumulate. update() is idempotent: ArcticDB replaces any existing rows whose dates overlap with the input DataFrame, so a re-run with the same or updated row produces at most one row per date regardless of invocation count. The read-check at line 195 (if today_ts in hist.index: skip) stays — it's an efficiency guard that avoids the write entirely when the row already exists. update() is the safety net when that check misses. tests/test_daily_append_semantics.py — source-level regression guards against a future revert to append() on any of the three sites. Follow-up: once this has been in place for 1-2 full Saturday cycles, remove the defensive dedup in alpha-engine-predictor data/dataset.py (`_load_ticker_parquet`). Track on ROADMAP under Data Platform. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cipher813 merged commit 1e98ccb into main Apr 14, 2026
1 check passed

cipher813 deleted the feat/flow-doctor-integration branch April 14, 2026 15:14

cipher813 mentioned this pull request Apr 14, 2026

Add preflight checks to weekly_collector entrypoint #27

Closed

4 tasks

cipher813 mentioned this pull request Apr 14, 2026

Adopt alpha-engine-lib v0.1.0 for preflight + logging #28

Merged

5 tasks

cipher813 mentioned this pull request Apr 15, 2026

daily_append: use update() instead of append() (dedup at source) #35

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate flow-doctor for failure alerting#26

Integrate flow-doctor for failure alerting#26
cipher813 merged 1 commit into
mainfrom
feat/flow-doctor-integration

cipher813 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 14, 2026

Summary

Changes

Deployment steps (user action — out of scope for this PR)

Out of scope (tracked for follow-up)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant