Skip to content

Hard-fail daily_append + propagate exit code in weekday SSM#24

Merged
cipher813 merged 3 commits into
mainfrom
fix/daily-append-hard-fail
Apr 14, 2026
Merged

Hard-fail daily_append + propagate exit code in weekday SSM#24
cipher813 merged 3 commits into
mainfrom
fix/daily-append-hard-fail

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

ArcticDB universe library last wrote 2026-04-12. No writes on 4/13 or 4/14 despite the weekday Step Function's DailyData step reporting SUCCEEDED both days. Two silent-fail chains were masking this:

  1. daily_append.py silently turned ArcticDB-wide failures into `status=ok`. `universe_lib.read(ticker)` exceptions logged at `debug` and counted as `n_skip`, so if all 909 tickers failed the function returned success with `tickers_appended=0`. Same pattern on macro reads, per-ticker appends, and macro bar appends. `_load_daily_closes` returned `{}` on S3 error.
  2. Weekday SSM command ignored Python exit code. `python ... | tee ... ; echo EXIT_CODE=$?` — the `echo` is what SSM sees, so the shell always exited 0. Saturday's pipeline already uses `set -eo pipefail`; the weekday one was never updated. Weekday also did not git-pull, so any post-Saturday change on EC2 was invisible.

Changes

  • `builders/daily_append.py`
    • `_load_daily_closes` raises on missing file / zero rows.
    • Macro reads raise on SPY missing; warn on others.
    • Per-ticker read failures count `n_err` at warning level (was `n_skip` at debug).
    • Function raises `RuntimeError` if `n_ok == 0` or `n_err > 5%` of stock tickers.
  • `weekly_collector.py` — `logger.exception` in `_run_daily` try/except blocks so tracebacks reach SSM logs. Status propagation to `main()` `SystemExit(1)` unchanged.
  • `infrastructure/step_function_daily.json` — `set -eo pipefail` + git pull matching the Saturday step. Dropped the trailing `echo EXIT_CODE` line.

Out of scope (tracked follow-ups)

  • PR 2: repo-wide silent-fail audit across rest of alpha-engine-data collectors. Expect similar `except ... log.debug` patterns in macro, universe_returns, alternative collectors.
  • PR 4: flow-doctor integration on the weekly_collector entrypoint so failures produce structured reports.
  • PR 3 (roadmap): pre-flight health check pattern — design TBD.

Test plan

  • `pytest tests/ --ignore=tests/integration -q` — 41 passed
  • `python3 -c 'import ast; ast.parse(...)'` on all three changed files
  • `python3 -c 'import json; json.load(...)'` on step_function_daily.json
  • Deploy Step Function update: `aws stepfunctions update-state-machine ...` (owner: Brian — manual)
  • Next weekday run (tomorrow 6:05 AM PT) — confirm DailyData step runs git pull, daily_append writes to ArcticDB, and the `ArcticDB daily_append: n_ok=N n_skip=S n_err=E` line appears in /var/log/daily-data.log
  • Confirm 4/15 entry appears in `s3://alpha-engine-research/arcticdb/universe.../tdata/`

Deployment

Merge → EC2 will git-pull on next DailyData run. Step Function JSON must be pushed with `aws stepfunctions update-state-machine` (not auto-deployed).

🤖 Generated with Claude Code

cipher813 and others added 3 commits April 14, 2026 07:45
Root cause: ArcticDB universe library last wrote 2026-04-12. No writes on
4/13 or 4/14 even though the weekday Step Function's DailyData step
reported SUCCEEDED both days.

Two silent-fail chains were masking this:

1. `daily_append.py` swallowed `universe_lib.read(ticker)` exceptions at
   `log.debug` level and counted them as `n_skip`, so an ArcticDB-wide
   auth/URI failure would report `status=ok` with `tickers_appended=0`.
   Same pattern on macro-series reads, per-ticker appends, and macro
   bar appends. `_load_daily_closes` returned `{}` on any S3 error.

2. The weekday Step Function's DailyData SSM command was
   `python ... | tee ... ; echo EXIT_CODE=$?`. The final `echo` is what
   SSM sees, so the shell always exited 0 regardless of Python's exit
   code. Saturday's `alpha-engine-saturday-pipeline` already uses
   `set -eo pipefail`; the weekday one was never updated. Weekday also
   did not git-pull, so any post-Saturday change on EC2 was invisible.

Changes:
- `builders/daily_append.py`: `_load_daily_closes` raises on missing
  file / zero rows; macro reads raise on SPY missing and warn on others;
  per-ticker read failures count `n_err` (not `n_skip`) at warning level;
  the function raises `RuntimeError` if `n_ok == 0` or `n_err > 5%` of
  stock tickers. Returns structured result on success.
- `weekly_collector.py`: `_run_daily` now uses `logger.exception` so the
  traceback reaches SSM logs; status propagation to `main()`'s
  `SystemExit(1)` path unchanged.
- `infrastructure/step_function_daily.json`: `set -eo pipefail` + weekday
  git pull, matching the Saturday step. Dropped the `echo EXIT_CODE` line
  so the final command is actually the Python script.

Remaining silent-fail patterns across the rest of alpha-engine-data are
deferred to a repo-wide audit PR (tracked).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses inconsistency in the first commit: non-SPY macro series
(VIX, VIX3M, TNX, IRX, GLD, USO) and sector ETF reads/writes were
log.warning + continue, which is visible but doesn't halt. Per
feedback_hard_fail_until_stable, every non-ok condition must exit
non-zero while Alpha Engine is unstable. A missing VIX silently
produces zero-valued regime-interaction features; a missing sector ETF
silently corrupts features for every stock in that sector.

All macro + sector ETF read failures and append failures now raise
RuntimeError. Per-ticker universe read/append keeps its 5% tolerance —
individual tickers can legitimately be new / not-yet-backfilled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit ba8244d into main Apr 14, 2026
1 check passed
@cipher813 cipher813 deleted the fix/daily-append-hard-fail branch April 14, 2026 14:54
cipher813 added a commit that referenced this pull request Apr 14, 2026
Follow-up to PR #24 (daily_append). Audits the rest of the alpha-engine-data
daily production path for the same "except Exception: log.debug / pass"
pattern that masked ArcticDB staleness for two days. Scope limited to
files on the DailyData + feature-store write path; RAG ingestion,
emailer, and fundamentals scan deferred to a later sweep.

features/compute.py
- `_load_daily_closes_delta`: per-date NoSuchKey upgraded to WARNING;
  every other S3 exception raises; raise if ALL dates in the business-day
  range were missing (the fingerprint of an upstream daily_closes outage).
- Per-ticker feature computation failure: log.debug → log.warning;
  new RuntimeError if `n_err / len(universe_tickers) > 5%`, matching
  daily_append.
- Empty store_rows now raises instead of returning `status=error`
  (status return is legacy; raising is consistent with hard-fail).
- `_load_cached_alternative` outer except: log.debug → log.warning so
  auth / network failures surface even when "no alt data" is expected.
- Schema version write: log.debug → log.warning with a comment
  explaining why this one stays non-raising (drift-check metadata, not
  feature data).

collectors/daily_closes.py
- head_object idempotency check: bare `except Exception: pass` → catch
  only ClientError with 404/NoSuchKey. Auth / throttling now raises
  instead of silently proceeding.
- Per-ticker yfinance extract: log.debug → log.warning so partial
  yfinance failures are visible in the daily log.

collectors/macro.py
- `load_from_s3`: pointer-missing still returns None (expected), but
  every other error now raises instead of masquerading as "no data."

weekly_collector.py
- S3 constituents load fallback: bare `except Exception: pass` → warning
  with context. The Wikipedia fallback remains (legitimate failover).
- Wikipedia constituents failure: bare `except Exception: pass` →
  ERROR log. The downstream `if not tickers` guard already hard-fails.

Out of scope (tracked for follow-up audit)
- rag/* (SEC / 8-K / earnings / theses ingestion)
- emailer.py `except Exception: pass` in finalize email
- features/compute.py fundamentals/alternative per-key fallback chains
- collectors/fundamentals.py per-ticker fetch log.debug

Dead code flagged for later removal
- features/reader.py — read_feature_snapshot / read_feature_range /
  latest_available_date / read_registry. No callers remain inside
  alpha-engine-data. Consumers in sibling repos (predictor / backtester
  / research) will migrate away as the ArcticDB cutover completes; safe
  to delete once the cross-repo migration is confirmed clean.

Tests: 41/41 pass. No new tests added — the hard-fail paths are
essentially `if cond: raise` and are better exercised by the existing
integration test suite against a live S3 bucket (follow-up).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Hardened failures from PRs #24 + #25 now raise cleanly up through
_run_daily and weekly_collector.main(). The pipeline exits non-zero,
but there's no alerting — a 6 AM failure is only visible if someone
manually reads CloudWatch / the SSM log. Flow-doctor closes that loop:
any ERROR-level log (including traceback-emitting exceptions from
logger.exception) fires email + a deduped GitHub issue.

Mirrors the executor's integration (alpha-engine/executor/log_config.py,
alpha-engine/flow-doctor.yaml) verbatim:

- requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0
- log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level,
  attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is
  inside _attach_flow_doctor so local dev without the dep installed
  still works.
- flow-doctor.yaml.example: committed template. Real file is gitignored
  — will be staged from alpha-engine-config at deploy time (same pattern
  as predictor.yaml, risk.yaml).

Out of scope for this PR (deployment steps — user action required)
- Add flow-doctor.yaml to alpha-engine-config repo at path matching what
  the Step Functions expect.
- Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already
  in /home/ec2-user/.alpha-engine.env (executor already uses these, so
  likely yes, but needs confirmation).
- First live fire: next daily or Saturday Step Function run — expect an
  email + issue if any of the new hard-fail paths trigger.

Out of scope (tracked)
- Same integration for predictor Lambda (different deploy model — needs
  flow-doctor packaged into the Lambda image or layer).
- Same integration for research Lambda.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Hardened failures from PRs #24 + #25 now raise cleanly up through
_run_daily and weekly_collector.main(). The pipeline exits non-zero,
but there's no alerting — a 6 AM failure is only visible if someone
manually reads CloudWatch / the SSM log. Flow-doctor closes that loop:
any ERROR-level log (including traceback-emitting exceptions from
logger.exception) fires email + a deduped GitHub issue.

Mirrors the executor's integration (alpha-engine/executor/log_config.py,
alpha-engine/flow-doctor.yaml) verbatim:

- requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0
- log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level,
  attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is
  inside _attach_flow_doctor so local dev without the dep installed
  still works.
- flow-doctor.yaml.example: committed template. Real file is gitignored
  — will be staged from alpha-engine-config at deploy time (same pattern
  as predictor.yaml, risk.yaml).

Out of scope for this PR (deployment steps — user action required)
- Add flow-doctor.yaml to alpha-engine-config repo at path matching what
  the Step Functions expect.
- Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already
  in /home/ec2-user/.alpha-engine.env (executor already uses these, so
  likely yes, but needs confirmation).
- First live fire: next daily or Saturday Step Function run — expect an
  email + issue if any of the new hard-fail paths trigger.

Out of scope (tracked)
- Same integration for predictor Lambda (different deploy model — needs
  flow-doctor packaged into the Lambda image or layer).
- Same integration for research Lambda.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Addresses the class of failure surfaced 2026-04-14: daily_append
silently not writing to ArcticDB for two weekdays because arcticdb
wasn't in the deploy image and the import error was swallowed. A
preflight check catches this in ~1s instead of letting the pipeline
run to "success" with stale data.

Pattern D (simplest): inline _preflight() in weekly_collector.py,
called from main() after config load, before run_weekly(). No new
files, no shared library. If the check pattern proves valuable across
the other modules, we can extract a common helper later — but the
per-repo checks are small enough (~30 LOC) that inlining is fine for
now.

Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT
correctness of the collection itself. The hardened collectors from
PRs #24 + #25 still own data-integrity hard-fails.

Checks by mode
- daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library
  readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend +
  buffer). 4-day stale SPY would have caught today's bug on 2026-04-14
  instead of letting Friday's write look healthy until Saturday.
- phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable.
- phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable.

Failures raise RuntimeError. main() already exits 1 on any SystemExit
path, and flow-doctor (#26, once deployed) will dispatch the
corresponding ERROR log as email + GitHub issue.

Out of scope (tracked)
- Same pattern in predictor inference + training, research Lambda,
  executor entrypoints, backtester. Rolling out after this first
  consumer proves the shape.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Applies the PR #24/#25/#28 pattern to the Saturday RAG ingestion path.
The shell script was silently swallowing failures in all 5 pipelines,
making "RAG Weekly Ingestion Complete" a lie whenever any step failed.

Shell script (rag/pipelines/run_weekly_ingestion.sh)
- Removed `|| echo "WARNING: ... (non-fatal)"` from all 5 ingestion
  steps, the CloudWatch heartbeat, and the completion email. set -e
  was already active but these swallowers defeated it.
- Removed the runtime `if [ -n "$FINNHUB_API_KEY" ]; then ... else
  echo SKIPPED fi` branch. All required env vars are now hard-failed
  by preflight (step 0) before any ingestion runs. A silently-skipped
  earnings transcript step defeats the purpose of having transcripts
  at all.
- Added `Step 0/5: python -m rag.preflight` at the top.
- The hardcoded 'status: ok' completion email is now truthful rather
  than aspirational — with set -e active and no swallowers, reaching
  the email means all 5 pipelines actually succeeded.

New file: rag/preflight.py
- RAGPreflight(BasePreflight) subclass — composes check_env_vars
  (AWS_REGION, VOYAGE_API_KEY, FINNHUB_API_KEY, EDGAR_IDENTITY,
  RAG_DATABASE_URL) + check_s3_bucket.
- main() uses alpha-engine-lib's setup_logging with the shared
  flow-doctor.yaml path, so a preflight failure fires email + issue
  via the existing dispatch.

rag/db.py
- is_available(): log.debug → log.warning for the exception path.
  The function was otherwise unchanged — it's a non-raising probe for
  future retrieval-side consumers. Flagged as unused inside
  alpha-engine-data (zero callers); defer deletion until cross-repo
  audit completes, since predictor / research / backtester may import
  from it.

rag/pipelines/ingest_8k_filings.py
- Per-URL download failure: log.debug → log.warning. Caller still
  treats None as "skip this filing" (aggregate counts are reported),
  so no behavior change; the failure rate is just visible now.

Dead code flagged (no change in this PR)
- rag/db.py::is_available — zero local callers. Keep for now, flag
  for future cross-repo sweep.

Out of scope (tracked)
- Adopt alpha-engine-lib setup_logging in each ingestion script's
  main() for consistent log formatting + flow-doctor capture of
  per-pipeline errors. Currently only preflight.py uses the lib;
  ingestion scripts still use Python's default root logger. Minor
  follow-up.
- Date-parsing `except ValueError: continue` patterns in
  ingest_sec_filings, ingest_8k_filings, ingest_theses,
  ingest_earnings_transcripts. Reviewed case-by-case — all are
  legitimate "skip this malformed entry" flows with aggregate counts
  upstream. Not silent fails.

Test plan
- [x] pytest tests/ — 41 pass
- [x] Syntax check on all modified Python files
- [x] bash -n on run_weekly_ingestion.sh
- [ ] Next Saturday Step Function run exercises the hardened path.
  Forced failure test: unset FINNHUB_API_KEY on EC2 and re-run —
  must fail at preflight (step 0), not silently skip step 3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
2026-04-14 live run discovery: the n_ok==0 hard-fail guard added in
PR #24 is a false positive on legitimate no-op reruns. When 900/902
tickers already have today's row in ArcticDB (because this morning's
Step Function write succeeded), the loop correctly takes the "today
already exists" skip path for each — n_ok=0, n_skip=900, n_err=2
(2 newly-listed tickers Q and SOLS not yet backfilled). My guard
raised RuntimeError on that, treating "nothing to write because all
done" as "failed to write anything."

The real silent-fail this guard was meant to catch (ArcticDB-wide
auth/connectivity failure → every read throws) was reclassified to
n_err (not n_skip) as part of PR #24. So the err_rate > 5% threshold
already catches the true failure case, without false positives on
no-op reruns or idempotent retries.

Kept: the err_rate > 5% threshold. If ArcticDB is genuinely broken,
n_err will exceed 45 tickers on a 902-ticker run and this will fire.

Net behavior:
- n_ok=0, n_skip=902, n_err=0  → pass (idempotent rerun — everyone already wrote)
- n_ok=900, n_skip=0, n_err=2  → pass (normal run with 2 missing tickers)
- n_ok=0, n_skip=0, n_err=902  → fail (ArcticDB auth broken)
- n_ok=800, n_skip=0, n_err=102 → fail (err rate > 5%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant