Skip to content

Integrate flow-doctor for failure alerting#26

Merged
cipher813 merged 1 commit into
mainfrom
feat/flow-doctor-integration
Apr 14, 2026
Merged

Integrate flow-doctor for failure alerting#26
cipher813 merged 1 commit into
mainfrom
feat/flow-doctor-integration

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

PRs #24 + #25 hardened the alpha-engine-data daily production path so failures raise and exit non-zero. But there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from `logger.exception`) fires email + a deduped GitHub issue.

Mirrors the executor's integration (`alpha-engine/executor/log_config.py`, `alpha-engine/flow-doctor.yaml`) verbatim.

Changes

  • `requirements.txt` — `flow-doctor[diagnosis]>=0.3.0,<0.4.0`
  • `log_config.py` — `FlowDoctor` singleton + `FlowDoctorHandler` at ERROR level attached to root logger when `FLOW_DOCTOR_ENABLED=1`. Lazy import so local dev without the dep installed still works.
  • `flow-doctor.yaml.example` — committed template. Points at `cipher813/alpha-engine-data` for GitHub issues and uses the same shared SMTP env vars as the executor config.
  • `.gitignore` — `flow-doctor.yaml` ignored (real file will be staged from `alpha-engine-config`).

Deployment steps (user action — out of scope for this PR)

  1. Add `flow-doctor.yaml` to `alpha-engine-config` repo at a path the deploy pipeline can stage from.
  2. Confirm `FLOW_DOCTOR_ENABLED=1` and `FLOW_DOCTOR_GITHUB_TOKEN` are already in `/home/ec2-user/.alpha-engine.env`. The executor uses the same vars, so likely yes — needs confirmation.
  3. First live fire: next daily or Saturday Step Function run. If any of the hard-fail paths added in PRs Hard-fail daily_append + propagate exit code in weekday SSM #24 + Convert silent fails in daily production path to hard fails #25 trigger, expect an email + an issue on this repo.

Out of scope (tracked for follow-up)

  • Same integration for predictor Lambda (different deploy model — flow-doctor needs to be packaged into the Lambda image or a layer).
  • Same integration for research Lambda.

Test plan

  • `pytest tests/ --ignore=tests/integration -q` — 41 passed
  • `python -c 'from log_config import setup_logging; setup_logging("test")'` — no-op when `FLOW_DOCTOR_ENABLED` unset, matches local-dev expectation
  • EC2 install flow-doctor after PR merge (`pip install -r requirements.txt`)
  • Stage `flow-doctor.yaml` from alpha-engine-config
  • Next pipeline run — confirm flow-doctor init logs at INFO; no spurious errors on healthy run; a forced failure (e.g., `aws s3 rm` the daily_closes parquet) produces an email + issue

🤖 Generated with Claude Code

Hardened failures from PRs #24 + #25 now raise cleanly up through
_run_daily and weekly_collector.main(). The pipeline exits non-zero,
but there's no alerting — a 6 AM failure is only visible if someone
manually reads CloudWatch / the SSM log. Flow-doctor closes that loop:
any ERROR-level log (including traceback-emitting exceptions from
logger.exception) fires email + a deduped GitHub issue.

Mirrors the executor's integration (alpha-engine/executor/log_config.py,
alpha-engine/flow-doctor.yaml) verbatim:

- requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0
- log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level,
  attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is
  inside _attach_flow_doctor so local dev without the dep installed
  still works.
- flow-doctor.yaml.example: committed template. Real file is gitignored
  — will be staged from alpha-engine-config at deploy time (same pattern
  as predictor.yaml, risk.yaml).

Out of scope for this PR (deployment steps — user action required)
- Add flow-doctor.yaml to alpha-engine-config repo at path matching what
  the Step Functions expect.
- Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already
  in /home/ec2-user/.alpha-engine.env (executor already uses these, so
  likely yes, but needs confirmation).
- First live fire: next daily or Saturday Step Function run — expect an
  email + issue if any of the new hard-fail paths trigger.

Out of scope (tracked)
- Same integration for predictor Lambda (different deploy model — needs
  flow-doctor packaged into the Lambda image or layer).
- Same integration for research Lambda.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 1e98ccb into main Apr 14, 2026
1 check passed
@cipher813 cipher813 deleted the feat/flow-doctor-integration branch April 14, 2026 15:14
cipher813 added a commit that referenced this pull request Apr 14, 2026
Addresses the class of failure surfaced 2026-04-14: daily_append
silently not writing to ArcticDB for two weekdays because arcticdb
wasn't in the deploy image and the import error was swallowed. A
preflight check catches this in ~1s instead of letting the pipeline
run to "success" with stale data.

Pattern D (simplest): inline _preflight() in weekly_collector.py,
called from main() after config load, before run_weekly(). No new
files, no shared library. If the check pattern proves valuable across
the other modules, we can extract a common helper later — but the
per-repo checks are small enough (~30 LOC) that inlining is fine for
now.

Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT
correctness of the collection itself. The hardened collectors from
PRs #24 + #25 still own data-integrity hard-fails.

Checks by mode
- daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library
  readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend +
  buffer). 4-day stale SPY would have caught today's bug on 2026-04-14
  instead of letting Friday's write look healthy until Saturday.
- phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable.
- phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable.

Failures raise RuntimeError. main() already exits 1 on any SystemExit
path, and flow-doctor (#26, once deployed) will dispatch the
corresponding ERROR log as email + GitHub issue.

Out of scope (tracked)
- Same pattern in predictor inference + training, research Lambda,
  executor entrypoints, backtester. Rolling out after this first
  consumer proves the shape.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Replaces the inline log_config.py (committed 2026-04-14 in PR #26) and
the proposed-then-closed inline _preflight (PR #27) with the shared
library. alpha-engine-lib is now the single source of truth for these
cross-cutting concerns; drift across consumer repos becomes impossible
by construction.

Changes
- requirements.txt: add alpha-engine-lib @ git+…@v0.1.0 with
  [arcticdb,flow_doctor] extras. Drop the direct flow-doctor pin —
  pulled in transitively by the extra.
- preflight.py: new module with DataPreflight(BasePreflight). Composes
  mode-specific check sequences (daily / phase1 / phase2) on top of
  the shared primitives. ~40 LOC.
- weekly_collector.py:
  - Import setup_logging from alpha_engine_lib.logging (was local
    log_config). Pass flow-doctor.yaml path explicitly since the lib
    version is path-parametric (each consumer has its own yaml).
  - Call DataPreflight(...).run() at the top of main(), after config
    load, before run_weekly().
- log_config.py: deleted. The lib version is now the sole copy.

Test plan
- [x] pytest tests/ — 41 pass
- [x] Import smoke: `from preflight import DataPreflight` works against
  locally-installed alpha-engine-lib v0.1.0
- [ ] EC2: pip install -r requirements.txt must succeed. The git+ URL
  requires the existing ~/.netrc PAT to have Contents: read on
  cipher813/alpha-engine-lib. If it doesn't, the install fails loud
  and visibly (no silent fall-through).
- [ ] Next weekday DailyData run: confirm "Pre-flight OK (mode=daily)"
  appears before "COLLECTING: daily closes".
- [ ] Forced failure test: `FRED_API_KEY= python weekly_collector.py
  --phase 1 --dry-run` should raise at preflight.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 15, 2026
…rce)

Root cause fix for the 2026-04-15 predictor retrain outage where
904/909 tickers in the universe library had duplicate date rows when
read back from ArcticDB. That failure was worked around defensively in
the predictor loader (alpha-engine-predictor PR #26) and the inference
loader has had equivalent defensive dedup for some time (load_prices.py
line 403). This PR fixes the accumulation at the write site.

Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol,
today_row)` at the three daily write sites in builders/daily_append.py:
  - universe_lib.update(ticker, today_row)  (line 251)
  - macro_lib.update(key, new_row)           (line 269)
  - macro_lib.update(sym, new_row)           (line 286)

append() adds rows without dedup — if daily_append runs twice for the
same date (race, retry, concurrent Saturday+Sunday pipelines), rows
accumulate. update() is idempotent: ArcticDB replaces any existing
rows whose dates overlap with the input DataFrame, so a re-run with
the same or updated row produces at most one row per date regardless
of invocation count.

The read-check at line 195 (if today_ts in hist.index: skip) stays —
it's an efficiency guard that avoids the write entirely when the row
already exists. update() is the safety net when that check misses.

tests/test_daily_append_semantics.py — source-level regression guards
against a future revert to append() on any of the three sites.

Follow-up: once this has been in place for 1-2 full Saturday cycles,
remove the defensive dedup in alpha-engine-predictor data/dataset.py
(`_load_ticker_parquet`). Track on ROADMAP under Data Platform.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 15, 2026
…rce) (#35)

Root cause fix for the 2026-04-15 predictor retrain outage where
904/909 tickers in the universe library had duplicate date rows when
read back from ArcticDB. That failure was worked around defensively in
the predictor loader (alpha-engine-predictor PR #26) and the inference
loader has had equivalent defensive dedup for some time (load_prices.py
line 403). This PR fixes the accumulation at the write site.

Change: swap `lib.append(symbol, today_row)` for `lib.update(symbol,
today_row)` at the three daily write sites in builders/daily_append.py:
  - universe_lib.update(ticker, today_row)  (line 251)
  - macro_lib.update(key, new_row)           (line 269)
  - macro_lib.update(sym, new_row)           (line 286)

append() adds rows without dedup — if daily_append runs twice for the
same date (race, retry, concurrent Saturday+Sunday pipelines), rows
accumulate. update() is idempotent: ArcticDB replaces any existing
rows whose dates overlap with the input DataFrame, so a re-run with
the same or updated row produces at most one row per date regardless
of invocation count.

The read-check at line 195 (if today_ts in hist.index: skip) stays —
it's an efficiency guard that avoids the write entirely when the row
already exists. update() is the safety net when that check misses.

tests/test_daily_append_semantics.py — source-level regression guards
against a future revert to append() on any of the three sites.

Follow-up: once this has been in place for 1-2 full Saturday cycles,
remove the defensive dedup in alpha-engine-predictor data/dataset.py
(`_load_ticker_parquet`). Track on ROADMAP under Data Platform.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant