Skip to content

Convert silent fails in daily production path to hard fails#25

Merged
cipher813 merged 1 commit into
mainfrom
fix/silent-fail-audit
Apr 14, 2026
Merged

Convert silent fails in daily production path to hard fails#25
cipher813 merged 1 commit into
mainfrom
fix/silent-fail-audit

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Follow-up to #24. Audits the rest of the alpha-engine-data daily production path for the `except Exception: log.debug / pass` pattern that masked ArcticDB staleness for two days. Scope limited to files on the DailyData + feature-store write path; RAG, emailer, and fundamentals scan deferred.

Changes

`features/compute.py`

  • `_load_daily_closes_delta`: NoSuchKey → WARNING; every other S3 exception raises; raise if all dates in the business-day range are missing.
  • Per-ticker feature compute failure: `log.debug` → `log.warning`; RuntimeError if `n_err / len(universe_tickers) > 5%` (matches daily_append).
  • Empty `store_rows` raises instead of `return {status: error}`.
  • `_load_cached_alternative` outer except: `log.debug` → `log.warning`.
  • Schema version write: upgraded to WARNING, kept non-raising (drift metadata, not features).

`collectors/daily_closes.py`

  • head_object idempotency check: bare except → only `ClientError` with 404/NoSuchKey. Auth / throttling now raises.
  • Per-ticker yfinance extract: `log.debug` → `log.warning`.

`collectors/macro.py`

  • `load_from_s3`: pointer-missing returns None; other errors raise.

`weekly_collector.py`

  • S3 constituents load + Wikipedia fallback: bare `except pass` → WARNING + ERROR with context (fallback chain preserved).

Dead code flagged (defer to cleanup PR)

  • `features/reader.py` — `read_feature_snapshot / read_feature_range / latest_available_date / read_registry`. Zero callers in alpha-engine-data. Cross-repo consumers (predictor, backtester, research) will migrate away as ArcticDB cutover lands. Safe to delete once migration is confirmed.

Out of scope (tracked for follow-up audit)

  • `rag/*` (SEC / 8-K / earnings / theses ingestion)
  • `emailer.py` `except Exception: pass` in finalize email
  • `features/compute.py` fundamentals/alternative per-key fallback chains
  • `collectors/fundamentals.py` per-ticker fetch `log.debug`

Test plan

  • `pytest tests/ --ignore=tests/integration -q` — 41 passed
  • Saturday weekly Step Function run (first exercise of hardened feature compute path on full universe)
  • Next weekday DailyData run (exercises hardened `_load_daily_closes_delta`)

🤖 Generated with Claude Code

Follow-up to PR #24 (daily_append). Audits the rest of the alpha-engine-data
daily production path for the same "except Exception: log.debug / pass"
pattern that masked ArcticDB staleness for two days. Scope limited to
files on the DailyData + feature-store write path; RAG ingestion,
emailer, and fundamentals scan deferred to a later sweep.

features/compute.py
- `_load_daily_closes_delta`: per-date NoSuchKey upgraded to WARNING;
  every other S3 exception raises; raise if ALL dates in the business-day
  range were missing (the fingerprint of an upstream daily_closes outage).
- Per-ticker feature computation failure: log.debug → log.warning;
  new RuntimeError if `n_err / len(universe_tickers) > 5%`, matching
  daily_append.
- Empty store_rows now raises instead of returning `status=error`
  (status return is legacy; raising is consistent with hard-fail).
- `_load_cached_alternative` outer except: log.debug → log.warning so
  auth / network failures surface even when "no alt data" is expected.
- Schema version write: log.debug → log.warning with a comment
  explaining why this one stays non-raising (drift-check metadata, not
  feature data).

collectors/daily_closes.py
- head_object idempotency check: bare `except Exception: pass` → catch
  only ClientError with 404/NoSuchKey. Auth / throttling now raises
  instead of silently proceeding.
- Per-ticker yfinance extract: log.debug → log.warning so partial
  yfinance failures are visible in the daily log.

collectors/macro.py
- `load_from_s3`: pointer-missing still returns None (expected), but
  every other error now raises instead of masquerading as "no data."

weekly_collector.py
- S3 constituents load fallback: bare `except Exception: pass` → warning
  with context. The Wikipedia fallback remains (legitimate failover).
- Wikipedia constituents failure: bare `except Exception: pass` →
  ERROR log. The downstream `if not tickers` guard already hard-fails.

Out of scope (tracked for follow-up audit)
- rag/* (SEC / 8-K / earnings / theses ingestion)
- emailer.py `except Exception: pass` in finalize email
- features/compute.py fundamentals/alternative per-key fallback chains
- collectors/fundamentals.py per-ticker fetch log.debug

Dead code flagged for later removal
- features/reader.py — read_feature_snapshot / read_feature_range /
  latest_available_date / read_registry. No callers remain inside
  alpha-engine-data. Consumers in sibling repos (predictor / backtester
  / research) will migrate away as the ArcticDB cutover completes; safe
  to delete once the cross-repo migration is confirmed clean.

Tests: 41/41 pass. No new tests added — the hard-fail paths are
essentially `if cond: raise` and are better exercised by the existing
integration test suite against a live S3 bucket (follow-up).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit b11379a into main Apr 14, 2026
1 check passed
@cipher813 cipher813 deleted the fix/silent-fail-audit branch April 14, 2026 15:06
cipher813 added a commit that referenced this pull request Apr 14, 2026
Hardened failures from PRs #24 + #25 now raise cleanly up through
_run_daily and weekly_collector.main(). The pipeline exits non-zero,
but there's no alerting — a 6 AM failure is only visible if someone
manually reads CloudWatch / the SSM log. Flow-doctor closes that loop:
any ERROR-level log (including traceback-emitting exceptions from
logger.exception) fires email + a deduped GitHub issue.

Mirrors the executor's integration (alpha-engine/executor/log_config.py,
alpha-engine/flow-doctor.yaml) verbatim:

- requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0
- log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level,
  attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is
  inside _attach_flow_doctor so local dev without the dep installed
  still works.
- flow-doctor.yaml.example: committed template. Real file is gitignored
  — will be staged from alpha-engine-config at deploy time (same pattern
  as predictor.yaml, risk.yaml).

Out of scope for this PR (deployment steps — user action required)
- Add flow-doctor.yaml to alpha-engine-config repo at path matching what
  the Step Functions expect.
- Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already
  in /home/ec2-user/.alpha-engine.env (executor already uses these, so
  likely yes, but needs confirmation).
- First live fire: next daily or Saturday Step Function run — expect an
  email + issue if any of the new hard-fail paths trigger.

Out of scope (tracked)
- Same integration for predictor Lambda (different deploy model — needs
  flow-doctor packaged into the Lambda image or layer).
- Same integration for research Lambda.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Hardened failures from PRs #24 + #25 now raise cleanly up through
_run_daily and weekly_collector.main(). The pipeline exits non-zero,
but there's no alerting — a 6 AM failure is only visible if someone
manually reads CloudWatch / the SSM log. Flow-doctor closes that loop:
any ERROR-level log (including traceback-emitting exceptions from
logger.exception) fires email + a deduped GitHub issue.

Mirrors the executor's integration (alpha-engine/executor/log_config.py,
alpha-engine/flow-doctor.yaml) verbatim:

- requirements.txt: flow-doctor[diagnosis]>=0.3.0,<0.4.0
- log_config.py: FlowDoctor singleton + FlowDoctorHandler at ERROR level,
  attached to the root logger when FLOW_DOCTOR_ENABLED=1. Import is
  inside _attach_flow_doctor so local dev without the dep installed
  still works.
- flow-doctor.yaml.example: committed template. Real file is gitignored
  — will be staged from alpha-engine-config at deploy time (same pattern
  as predictor.yaml, risk.yaml).

Out of scope for this PR (deployment steps — user action required)
- Add flow-doctor.yaml to alpha-engine-config repo at path matching what
  the Step Functions expect.
- Verify FLOW_DOCTOR_ENABLED=1 and FLOW_DOCTOR_GITHUB_TOKEN are already
  in /home/ec2-user/.alpha-engine.env (executor already uses these, so
  likely yes, but needs confirmation).
- First live fire: next daily or Saturday Step Function run — expect an
  email + issue if any of the new hard-fail paths trigger.

Out of scope (tracked)
- Same integration for predictor Lambda (different deploy model — needs
  flow-doctor packaged into the Lambda image or layer).
- Same integration for research Lambda.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Addresses the class of failure surfaced 2026-04-14: daily_append
silently not writing to ArcticDB for two weekdays because arcticdb
wasn't in the deploy image and the import error was swallowed. A
preflight check catches this in ~1s instead of letting the pipeline
run to "success" with stale data.

Pattern D (simplest): inline _preflight() in weekly_collector.py,
called from main() after config load, before run_weekly(). No new
files, no shared library. If the check pattern proves valuable across
the other modules, we can extract a common helper later — but the
per-repo checks are small enough (~30 LOC) that inlining is fine for
now.

Scoped to external-world handshakes (env vars, S3, ArcticDB) — NOT
correctness of the collection itself. The hardened collectors from
PRs #24 + #25 still own data-integrity hard-fails.

Checks by mode
- daily: AWS_REGION env, S3 bucket reachable, ArcticDB universe library
  readable, SPY freshness ≤ 4 days (covers Fri → Tue long weekend +
  buffer). 4-day stale SPY would have caught today's bug on 2026-04-14
  instead of letting Friday's write look healthy until Saturday.
- phase1: AWS_REGION + FRED_API_KEY + POLYGON_API_KEY + S3 reachable.
- phase2: AWS_REGION + FMP_API_KEY + EDGAR_IDENTITY + S3 reachable.

Failures raise RuntimeError. main() already exits 1 on any SystemExit
path, and flow-doctor (#26, once deployed) will dispatch the
corresponding ERROR log as email + GitHub issue.

Out of scope (tracked)
- Same pattern in predictor inference + training, research Lambda,
  executor entrypoints, backtester. Rolling out after this first
  consumer proves the shape.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request Apr 14, 2026
Applies the PR #24/#25/#28 pattern to the Saturday RAG ingestion path.
The shell script was silently swallowing failures in all 5 pipelines,
making "RAG Weekly Ingestion Complete" a lie whenever any step failed.

Shell script (rag/pipelines/run_weekly_ingestion.sh)
- Removed `|| echo "WARNING: ... (non-fatal)"` from all 5 ingestion
  steps, the CloudWatch heartbeat, and the completion email. set -e
  was already active but these swallowers defeated it.
- Removed the runtime `if [ -n "$FINNHUB_API_KEY" ]; then ... else
  echo SKIPPED fi` branch. All required env vars are now hard-failed
  by preflight (step 0) before any ingestion runs. A silently-skipped
  earnings transcript step defeats the purpose of having transcripts
  at all.
- Added `Step 0/5: python -m rag.preflight` at the top.
- The hardcoded 'status: ok' completion email is now truthful rather
  than aspirational — with set -e active and no swallowers, reaching
  the email means all 5 pipelines actually succeeded.

New file: rag/preflight.py
- RAGPreflight(BasePreflight) subclass — composes check_env_vars
  (AWS_REGION, VOYAGE_API_KEY, FINNHUB_API_KEY, EDGAR_IDENTITY,
  RAG_DATABASE_URL) + check_s3_bucket.
- main() uses alpha-engine-lib's setup_logging with the shared
  flow-doctor.yaml path, so a preflight failure fires email + issue
  via the existing dispatch.

rag/db.py
- is_available(): log.debug → log.warning for the exception path.
  The function was otherwise unchanged — it's a non-raising probe for
  future retrieval-side consumers. Flagged as unused inside
  alpha-engine-data (zero callers); defer deletion until cross-repo
  audit completes, since predictor / research / backtester may import
  from it.

rag/pipelines/ingest_8k_filings.py
- Per-URL download failure: log.debug → log.warning. Caller still
  treats None as "skip this filing" (aggregate counts are reported),
  so no behavior change; the failure rate is just visible now.

Dead code flagged (no change in this PR)
- rag/db.py::is_available — zero local callers. Keep for now, flag
  for future cross-repo sweep.

Out of scope (tracked)
- Adopt alpha-engine-lib setup_logging in each ingestion script's
  main() for consistent log formatting + flow-doctor capture of
  per-pipeline errors. Currently only preflight.py uses the lib;
  ingestion scripts still use Python's default root logger. Minor
  follow-up.
- Date-parsing `except ValueError: continue` patterns in
  ingest_sec_filings, ingest_8k_filings, ingest_theses,
  ingest_earnings_transcripts. Reviewed case-by-case — all are
  legitimate "skip this malformed entry" flows with aggregate counts
  upstream. Not silent fails.

Test plan
- [x] pytest tests/ — 41 pass
- [x] Syntax check on all modified Python files
- [x] bash -n on run_weekly_ingestion.sh
- [ ] Next Saturday Step Function run exercises the hardened path.
  Forced failure test: unset FINNHUB_API_KEY on EC2 and re-run —
  must fail at preflight (step 0), not silently skip step 3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant