Hard-fail daily_append + propagate exit code in weekday SSM by cipher813 · Pull Request #24 · cipher813/alpha-engine-data

cipher813 · 2026-04-14T14:45:48Z

Summary

ArcticDB universe library last wrote 2026-04-12. No writes on 4/13 or 4/14 despite the weekday Step Function's DailyData step reporting SUCCEEDED both days. Two silent-fail chains were masking this:

daily_append.py silently turned ArcticDB-wide failures into `status=ok`. `universe_lib.read(ticker)` exceptions logged at `debug` and counted as `n_skip`, so if all 909 tickers failed the function returned success with `tickers_appended=0`. Same pattern on macro reads, per-ticker appends, and macro bar appends. `_load_daily_closes` returned `{}` on S3 error.
Weekday SSM command ignored Python exit code. `python ... | tee ... ; echo EXIT_CODE=$?` — the `echo` is what SSM sees, so the shell always exited 0. Saturday's pipeline already uses `set -eo pipefail`; the weekday one was never updated. Weekday also did not git-pull, so any post-Saturday change on EC2 was invisible.

Changes

`builders/daily_append.py`
- `_load_daily_closes` raises on missing file / zero rows.
- Macro reads raise on SPY missing; warn on others.
- Per-ticker read failures count `n_err` at warning level (was `n_skip` at debug).
- Function raises `RuntimeError` if `n_ok == 0` or `n_err > 5%` of stock tickers.
`weekly_collector.py` — `logger.exception` in `_run_daily` try/except blocks so tracebacks reach SSM logs. Status propagation to `main()` `SystemExit(1)` unchanged.
`infrastructure/step_function_daily.json` — `set -eo pipefail` + git pull matching the Saturday step. Dropped the trailing `echo EXIT_CODE` line.

Out of scope (tracked follow-ups)

PR 2: repo-wide silent-fail audit across rest of alpha-engine-data collectors. Expect similar `except ... log.debug` patterns in macro, universe_returns, alternative collectors.
PR 4: flow-doctor integration on the weekly_collector entrypoint so failures produce structured reports.
PR 3 (roadmap): pre-flight health check pattern — design TBD.

Test plan

`pytest tests/ --ignore=tests/integration -q` — 41 passed
`python3 -c 'import ast; ast.parse(...)'` on all three changed files
`python3 -c 'import json; json.load(...)'` on step_function_daily.json
Deploy Step Function update: `aws stepfunctions update-state-machine ...` (owner: Brian — manual)
Next weekday run (tomorrow 6:05 AM PT) — confirm DailyData step runs git pull, daily_append writes to ArcticDB, and the `ArcticDB daily_append: n_ok=N n_skip=S n_err=E` line appears in /var/log/daily-data.log
Confirm 4/15 entry appears in `s3://alpha-engine-research/arcticdb/universe.../tdata/`

Deployment

Merge → EC2 will git-pull on next DailyData run. Step Function JSON must be pushed with `aws stepfunctions update-state-machine` (not auto-deployed).

🤖 Generated with Claude Code

Root cause: ArcticDB universe library last wrote 2026-04-12. No writes on 4/13 or 4/14 even though the weekday Step Function's DailyData step reported SUCCEEDED both days. Two silent-fail chains were masking this: 1. `daily_append.py` swallowed `universe_lib.read(ticker)` exceptions at `log.debug` level and counted them as `n_skip`, so an ArcticDB-wide auth/URI failure would report `status=ok` with `tickers_appended=0`. Same pattern on macro-series reads, per-ticker appends, and macro bar appends. `_load_daily_closes` returned `{}` on any S3 error. 2. The weekday Step Function's DailyData SSM command was `python ... | tee ... ; echo EXIT_CODE=$?`. The final `echo` is what SSM sees, so the shell always exited 0 regardless of Python's exit code. Saturday's `alpha-engine-saturday-pipeline` already uses `set -eo pipefail`; the weekday one was never updated. Weekday also did not git-pull, so any post-Saturday change on EC2 was invisible. Changes: - `builders/daily_append.py`: `_load_daily_closes` raises on missing file / zero rows; macro reads raise on SPY missing and warn on others; per-ticker read failures count `n_err` (not `n_skip`) at warning level; the function raises `RuntimeError` if `n_ok == 0` or `n_err > 5%` of stock tickers. Returns structured result on success. - `weekly_collector.py`: `_run_daily` now uses `logger.exception` so the traceback reaches SSM logs; status propagation to `main()`'s `SystemExit(1)` path unchanged. - `infrastructure/step_function_daily.json`: `set -eo pipefail` + weekday git pull, matching the Saturday step. Dropped the `echo EXIT_CODE` line so the final command is actually the Python script. Remaining silent-fail patterns across the rest of alpha-engine-data are deferred to a repo-wide audit PR (tracked). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses inconsistency in the first commit: non-SPY macro series (VIX, VIX3M, TNX, IRX, GLD, USO) and sector ETF reads/writes were log.warning + continue, which is visible but doesn't halt. Per feedback_hard_fail_until_stable, every non-ok condition must exit non-zero while Alpha Engine is unstable. A missing VIX silently produces zero-valued regime-interaction features; a missing sector ETF silently corrupts features for every stock in that sector. All macro + sector ETF read failures and append failures now raise RuntimeError. Per-ticker universe read/append keeps its 5% tolerance — individual tickers can legitimately be new / not-yet-backfilled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Follow-up to PR #24 (daily_append). Audits the rest of the alpha-engine-data daily production path for the same "except Exception: log.debug / pass" pattern that masked ArcticDB staleness for two days. Scope limited to files on the DailyData + feature-store write path; RAG ingestion, emailer, and fundamentals scan deferred to a later sweep. features/compute.py - `_load_daily_closes_delta`: per-date NoSuchKey upgraded to WARNING; every other S3 exception raises; raise if ALL dates in the business-day range were missing (the fingerprint of an upstream daily_closes outage). - Per-ticker feature computation failure: log.debug → log.warning; new RuntimeError if `n_err / len(universe_tickers) > 5%`, matching daily_append. - Empty store_rows now raises instead of returning `status=error` (status return is legacy; raising is consistent with hard-fail). - `_load_cached_alternative` outer except: log.debug → log.warning so auth / network failures surface even when "no alt data" is expected. - Schema version write: log.debug → log.warning with a comment explaining why this one stays non-raising (drift-check metadata, not feature data). collectors/daily_closes.py - head_object idempotency check: bare `except Exception: pass` → catch only ClientError with 404/NoSuchKey. Auth / throttling now raises instead of silently proceeding. - Per-ticker yfinance extract: log.debug → log.warning so partial yfinance failures are visible in the daily log. collectors/macro.py - `load_from_s3`: pointer-missing still returns None (expected), but every other error now raises instead of masquerading as "no data." weekly_collector.py - S3 constituents load fallback: bare `except Exception: pass` → warning with context. The Wikipedia fallback remains (legitimate failover). - Wikipedia constituents failure: bare `except Exception: pass` → ERROR log. The downstream `if not tickers` guard already hard-fails. Out of scope (tracked for follow-up audit) - rag/* (SEC / 8-K / earnings / theses ingestion) - emailer.py `except Exception: pass` in finalize email - features/compute.py fundamentals/alternative per-key fallback chains - collectors/fundamentals.py per-ticker fetch log.debug Dead code flagged for later removal - features/reader.py — read_feature_snapshot / read_feature_range / latest_available_date / read_registry. No callers remain inside alpha-engine-data. Consumers in sibling repos (predictor / backtester / research) will migrate away as the ArcticDB cutover completes; safe to delete once the cross-repo migration is confirmed clean. Tests: 41/41 pass. No new tests added — the hard-fail paths are essentially `if cond: raise` and are better exercised by the existing integration test suite against a live S3 bucket (follow-up). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hardened failures from PRs #24 + #25 now raise cleanly up through _run_daily and weekly_collector.main(). The pipeline exits non-zero, but there's no alerting — a 6 AM failure is only visible if someone manually reads CloudWatch / the SSM log. Flow-doctor closes that loop: any ERROR-level log (including traceback-emitting exceptions from logger.exception) fires email + a deduped GitHub issue. Mirrors the executor's integration (alpha-engine/executor/log_config.py, alpha-engine/flow-doctor.yaml) verbatim: - requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0 - log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level, attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is inside _attach_flow_doctor so local dev without the dep installed still works. - flow-doctor.yaml.example: committed template. Real file is gitignored — will be staged from alpha-engine-config at deploy time (same pattern as predictor.yaml, risk.yaml). Out of scope for this PR (deployment steps — user action required) - Add flow-doctor.yaml to alpha-engine-config repo at path matching what the Step Functions expect. - Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already in /home/ec2-user/.alpha-engine.env (executor already uses these, so likely yes, but needs confirmation). - First live fire: next daily or Saturday Step Function run — expect an email + issue if any of the new hard-fail paths trigger. Out of scope (tracked) - Same integration for predictor Lambda (different deploy model — needs flow-doctor packaged into the Lambda image or layer). - Same integration for research Lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses the class of failure surfaced 2026-04-14: daily_append silently not writing to ArcticDB for two weekdays because arcticdb wasn't in the deploy image and the import error was swallowed. A preflight check catches this in ~1s instead of letting the pipeline run to "success" with stale data. Pattern D (simplest): inline _preflight() in weekly_collector.py, called from main() after config load, before run_weekly(). No new files, no shared library. If the check pattern proves valuable across the other modules, we can extract a common helper later — but the per-repo checks are small enough (~30 LOC) that inlining is fine for now. Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT correctness of the collection itself. The hardened collectors from PRs #24 + #25 still own data-integrity hard-fails. Checks by mode - daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend + buffer). 4-day stale SPY would have caught today's bug on 2026-04-14 instead of letting Friday's write look healthy until Saturday. - phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable. - phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable. Failures raise RuntimeError. main() already exits 1 on any SystemExit path, and flow-doctor (#26, once deployed) will dispatch the corresponding ERROR log as email + GitHub issue. Out of scope (tracked) - Same pattern in predictor inference + training, research Lambda, executor entrypoints, backtester. Rolling out after this first consumer proves the shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Applies the PR #24/#25/#28 pattern to the Saturday RAG ingestion path. The shell script was silently swallowing failures in all 5 pipelines, making "RAG Weekly Ingestion Complete" a lie whenever any step failed. Shell script (rag/pipelines/run_weekly_ingestion.sh) - Removed `|| echo "WARNING: ... (non-fatal)"` from all 5 ingestion steps, the CloudWatch heartbeat, and the completion email. set -e was already active but these swallowers defeated it. - Removed the runtime `if [ -n "$FINNHUB_API_KEY" ]; then ... else echo SKIPPED fi` branch. All required env vars are now hard-failed by preflight (step 0) before any ingestion runs. A silently-skipped earnings transcript step defeats the purpose of having transcripts at all. - Added `Step 0/5: python -m rag.preflight` at the top. - The hardcoded 'status: ok' completion email is now truthful rather than aspirational — with set -e active and no swallowers, reaching the email means all 5 pipelines actually succeeded. New file: rag/preflight.py - RAGPreflight(BasePreflight) subclass — composes check_env_vars (AWS_REGION, VOYAGE_API_KEY, FINNHUB_API_KEY, EDGAR_IDENTITY, RAG_DATABASE_URL) + check_s3_bucket. - main() uses alpha-engine-lib's setup_logging with the shared flow-doctor.yaml path, so a preflight failure fires email + issue via the existing dispatch. rag/db.py - is_available(): log.debug → log.warning for the exception path. The function was otherwise unchanged — it's a non-raising probe for future retrieval-side consumers. Flagged as unused inside alpha-engine-data (zero callers); defer deletion until cross-repo audit completes, since predictor / research / backtester may import from it. rag/pipelines/ingest_8k_filings.py - Per-URL download failure: log.debug → log.warning. Caller still treats None as "skip this filing" (aggregate counts are reported), so no behavior change; the failure rate is just visible now. Dead code flagged (no change in this PR) - rag/db.py::is_available — zero local callers. Keep for now, flag for future cross-repo sweep. Out of scope (tracked) - Adopt alpha-engine-lib setup_logging in each ingestion script's main() for consistent log formatting + flow-doctor capture of per-pipeline errors. Currently only preflight.py uses the lib; ingestion scripts still use Python's default root logger. Minor follow-up. - Date-parsing `except ValueError: continue` patterns in ingest_sec_filings, ingest_8k_filings, ingest_theses, ingest_earnings_transcripts. Reviewed case-by-case — all are legitimate "skip this malformed entry" flows with aggregate counts upstream. Not silent fails. Test plan - [x] pytest tests/ — 41 pass - [x] Syntax check on all modified Python files - [x] bash -n on run_weekly_ingestion.sh - [ ] Next Saturday Step Function run exercises the hardened path. Forced failure test: unset FINNHUB_API_KEY on EC2 and re-run — must fail at preflight (step 0), not silently skip step 3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 live run discovery: the n_ok==0 hard-fail guard added in PR #24 is a false positive on legitimate no-op reruns. When 900/902 tickers already have today's row in ArcticDB (because this morning's Step Function write succeeded), the loop correctly takes the "today already exists" skip path for each — n_ok=0, n_skip=900, n_err=2 (2 newly-listed tickers Q and SOLS not yet backfilled). My guard raised RuntimeError on that, treating "nothing to write because all done" as "failed to write anything." The real silent-fail this guard was meant to catch (ArcticDB-wide auth/connectivity failure → every read throws) was reclassified to n_err (not n_skip) as part of PR #24. So the err_rate > 5% threshold already catches the true failure case, without false positives on no-op reruns or idempotent retries. Kept: the err_rate > 5% threshold. If ArcticDB is genuinely broken, n_err will exceed 45 tickers on a 902-ticker run and this will fire. Net behavior: - n_ok=0, n_skip=902, n_err=0 → pass (idempotent rerun — everyone already wrote) - n_ok=900, n_skip=0, n_err=2 → pass (normal run with 2 missing tickers) - n_ok=0, n_skip=0, n_err=902 → fail (ArcticDB auth broken) - n_ok=800, n_skip=0, n_err=102 → fail (err rate > 5%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cipher813 and others added 3 commits April 14, 2026 07:45

Merge branch 'main' into fix/daily-append-hard-fail

2c1b121

cipher813 merged commit ba8244d into main Apr 14, 2026
1 check passed

cipher813 deleted the fix/daily-append-hard-fail branch April 14, 2026 14:54

cipher813 mentioned this pull request Apr 14, 2026

Convert silent fails in daily production path to hard fails #25

Merged

3 tasks

cipher813 mentioned this pull request Apr 14, 2026

Integrate flow-doctor for failure alerting #26

Merged

5 tasks

cipher813 mentioned this pull request Apr 14, 2026

Add preflight checks to weekly_collector entrypoint #27

Closed

4 tasks

cipher813 mentioned this pull request Apr 14, 2026

RAG hardening — remove silent fails + add preflight #31

Merged

5 tasks

cipher813 mentioned this pull request Apr 14, 2026

Remove over-eager n_ok==0 guard in daily_append #33

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard-fail daily_append + propagate exit code in weekday SSM#24

Hard-fail daily_append + propagate exit code in weekday SSM#24
cipher813 merged 3 commits into
mainfrom
fix/daily-append-hard-fail

cipher813 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented Apr 14, 2026

Summary

Changes

Out of scope (tracked follow-ups)

Test plan

Deployment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant