Integrate flow-doctor for failure alerting#26
Merged
Conversation
Hardened failures from PRs #24 + #25 now raise cleanly up through _run_daily and weekly_collector.main(). The pipeline exits non-zero, but there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from logger.exception) fires email + a deduped GitHub issue. Mirrors the executor's integration (alpha-engine/executor/log_config.py, alpha-engine/flow-doctor.yaml) verbatim: - requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0 - log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level, attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is inside _attach_flow_doctor so local dev without the dep installed still works. - flow-doctor.yaml.example: committed template. Real file is gitignored — will be staged from alpha-engine-config at deploy time (same pattern as predictor.yaml, risk.yaml). Out of scope for this PR (deployment steps — user action required) - Add flow-doctor.yaml to alpha-engine-config repo at path matching what the Step Functions expect. - Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already in /home/ec2-user/.alpha-engine.env (executor already uses these, so likely yes, but needs confirmation). - First live fire: next daily or Saturday Step Function run — expect an email + issue if any of the new hard-fail paths trigger. Out of scope (tracked) - Same integration for predictor Lambda (different deploy model — needs flow-doctor packaged into the Lambda image or layer). - Same integration for research Lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
Addresses the class of failure surfaced 2026-04-14: daily_append silently not writing to ArcticDB for two weekdays because arcticdb wasn't in the deploy image and the import error was swallowed. A preflight check catches this in ~1s instead of letting the pipeline run to "success" with stale data. Pattern D (simplest): inline _preflight() in weekly_collector.py, called from main() after config load, before run_weekly(). No new files, no shared library. If the check pattern proves valuable across the other modules, we can extract a common helper later — but the per-repo checks are small enough (~30 LOC) that inlining is fine for now. Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT correctness of the collection itself. The hardened collectors from PRs #24 + #25 still own data-integrity hard-fails. Checks by mode - daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend + buffer). 4-day stale SPY would have caught today's bug on 2026-04-14 instead of letting Friday's write look healthy until Saturday. - phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable. - phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable. Failures raise RuntimeError. main() already exits 1 on any SystemExit path, and flow-doctor (#26, once deployed) will dispatch the corresponding ERROR log as email + GitHub issue. Out of scope (tracked) - Same pattern in predictor inference + training, research Lambda, executor entrypoints, backtester. Rolling out after this first consumer proves the shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
Apr 14, 2026
Replaces the inline log_config.py (committed 2026-04-14 in PR #26) and the proposed-then-closed inline _preflight (PR #27) with the shared library. alpha-engine-lib is now the single source of truth for these cross-cutting concerns; drift across consumer repos becomes impossible by construction. Changes - requirements.txt: add alpha-engine-lib @ git+…@v0.1.0 with [arcticdb,flow_doctor] extras. Drop the direct flow-doctor pin — pulled in transitively by the extra. - preflight.py: new module with DataPreflight(BasePreflight). Composes mode-specific check sequences (daily / phase1 / phase2) on top of the shared primitives. ~40 LOC. - weekly_collector.py: - Import setup_logging from alpha_engine_lib.logging (was local log_config). Pass flow-doctor.yaml path explicitly since the lib version is path-parametric (each consumer has its own yaml). - Call DataPreflight(...).run() at the top of main(), after config load, before run_weekly(). - log_config.py: deleted. The lib version is now the sole copy. Test plan - [x] pytest tests/ — 41 pass - [x] Import smoke: `from preflight import DataPreflight` works against locally-installed alpha-engine-lib v0.1.0 - [ ] EC2: pip install -r requirements.txt must succeed. The git+ URL requires the existing ~/.netrc PAT to have Contents: read on cipher813/alpha-engine-lib. If it doesn't, the install fails loud and visibly (no silent fall-through). - [ ] Next weekday DailyData run: confirm "Pre-flight OK (mode=daily)" appears before "COLLECTING: daily closes". - [ ] Forced failure test: `FRED_API_KEY= python weekly_collector.py --phase 1 --dry-run` should raise at preflight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
cipher813
added a commit
that referenced
this pull request
Apr 15, 2026
…rce) Root cause fix for the 2026-04-15 predictor retrain outage where 904/909 tickers in the universe library had duplicate date rows when read back from ArcticDB. That failure was worked around defensively in the predictor loader (alpha-engine-predictor PR #26) and the inference loader has had equivalent defensive dedup for some time (load_prices.py line 403). This PR fixes the accumulation at the write site. Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol, today_row)` at the three daily write sites in builders/daily_append.py: - universe_lib.update(ticker, today_row) (line 251) - macro_lib.update(key, new_row) (line 269) - macro_lib.update(sym, new_row) (line 286) append() adds rows without dedup — if daily_append runs twice for the same date (race, retry, concurrent Saturday+Sunday pipelines), rows accumulate. update() is idempotent: ArcticDB replaces any existing rows whose dates overlap with the input DataFrame, so a re-run with the same or updated row produces at most one row per date regardless of invocation count. The read-check at line 195 (if today_ts in hist.index: skip) stays — it's an efficiency guard that avoids the write entirely when the row already exists. update() is the safety net when that check misses. tests/test_daily_append_semantics.py — source-level regression guards against a future revert to append() on any of the three sites. Follow-up: once this has been in place for 1-2 full Saturday cycles, remove the defensive dedup in alpha-engine-predictor data/dataset.py (`_load_ticker_parquet`). Track on ROADMAP under Data Platform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
cipher813
added a commit
that referenced
this pull request
Apr 15, 2026
…rce) (#35) Root cause fix for the 2026-04-15 predictor retrain outage where 904/909 tickers in the universe library had duplicate date rows when read back from ArcticDB. That failure was worked around defensively in the predictor loader (alpha-engine-predictor PR #26) and the inference loader has had equivalent defensive dedup for some time (load_prices.py line 403). This PR fixes the accumulation at the write site. Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol, today_row)` at the three daily write sites in builders/daily_append.py: - universe_lib.update(ticker, today_row) (line 251) - macro_lib.update(key, new_row) (line 269) - macro_lib.update(sym, new_row) (line 286) append() adds rows without dedup — if daily_append runs twice for the same date (race, retry, concurrent Saturday+Sunday pipelines), rows accumulate. update() is idempotent: ArcticDB replaces any existing rows whose dates overlap with the input DataFrame, so a re-run with the same or updated row produces at most one row per date regardless of invocation count. The read-check at line 195 (if today_ts in hist.index: skip) stays — it's an efficiency guard that avoids the write entirely when the row already exists. update() is the safety net when that check misses. tests/test_daily_append_semantics.py — source-level regression guards against a future revert to append() on any of the three sites. Follow-up: once this has been in place for 1-2 full Saturday cycles, remove the defensive dedup in alpha-engine-predictor data/dataset.py (`_load_ticker_parquet`). Track on ROADMAP under Data Platform. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PRs #24 + #25 hardened the alpha-engine-data daily production path so failures raise and exit non-zero. But there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from `logger.exception`) fires email + a deduped GitHub issue.
Mirrors the executor's integration (`alpha-engine/executor/log_config.py`, `alpha-engine/flow-doctor.yaml`) verbatim.
Changes
Deployment steps (user action — out of scope for this PR)
Out of scope (tracked for follow-up)
Test plan
🤖 Generated with Claude Code