Skip to content

feat(visibility): surface per-collector errors to Flow Doctor#191

Merged
cipher813 merged 1 commit into
mainfrom
feat/flow-doctor-collector-visibility
May 9, 2026
Merged

feat(visibility): surface per-collector errors to Flow Doctor#191
cipher813 merged 1 commit into
mainfrom
feat/flow-doctor-collector-visibility

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Why

The 2026-05-09 Saturday SF DataPhase1 PARTIAL run produced no per-failure Flow Doctor alert despite arcticdb failing on a backfill regression preflight. Tracing showed the only logger.error() that fires on partial status is main()'s end-of-run generic summary:

"Weekly collection finished with non-ok status=partial — Per-collector statuses: {arcticdb: error, ...}"

Two problems:

  1. Dedup collision. Same string for every partial run regardless of which collector failed → Flow Doctor's per-message dedup treats all partial runs as one alert. Cooldown window suppresses different-cause partial runs.
  2. No diagnose context. The actual error ("Backfill regression preflight failed: 38 symbols would regress...") lives only in the result dict. Flow Doctor's LLM diagnosis + GitHub-issue body has nothing specific to work with.

Fix placement

Call inside _finalize immediately after the status block:

  • Fires for every code path that finalizes (phase 1, phase 2, daily, morning enrich) without per-call wiring.
  • Runs before manifest write + postflight, so per-collector errors are visible to Flow Doctor even if postflight raises.
  • One line of orchestrator wiring, helper does the walking.

Dependency

⚠️ Blocked on cipher813/alpha-engine-lib#34 — that PR adds the collector_results module and tags v0.6.2. Merge order:

  1. Lib PR Fetch macro tickers daily via polygon → FRED → yfinance #34 → tag v0.6.2
  2. This PR's CI install of @v0.6.2 resolves
  3. Merge here

Test plan

  • New wiring test (2 cases): per-collector errors emit ERROR records; all-ok runs emit none. Both pass with lib installed locally.
  • Full unit suite: 552 passed, 1 skipped.
  • Sat 2026-05-16 SF cron will be the first production exercise. If a collector errors, expect a per-collector Flow Doctor alert in addition to the existing PARTIAL email.

Follow-up

  • Opportunistic adoption in alpha-engine-research / alpha-engine-predictor where the same status-dict pattern exists in stage outputs / per-team collectors.

🤖 Generated with Claude Code

Saturday SF DataPhase1 PARTIAL run on 2026-05-09 fired no per-failure
Flow Doctor alert. The arcticdb backfill regression was stored in the
result dict but never logged at ERROR level — only main()'s generic
"Weekly collection finished with non-ok status=partial" summary fires,
which produces a single dedup signature across every partial run and
contains no actual error text for Flow Doctor's LLM diagnose pipeline.

Fix: _finalize() now calls
alpha_engine_lib.collector_results.report_collector_errors(), which
emits one logger.error() per error-status entry with the collector name
+ original message. Each emitted record carries a distinct dedup
signature, restoring per-failure alert granularity.

- Pin alpha-engine-lib v0.5.1 → v0.6.2 (helper landed in lib PR #34)
- Wire call inside _finalize after status computation, before manifest
  write — fires for every code path that finalizes (phase 1, phase 2,
  daily, morning enrich) and runs even if postflight raises afterward
- New wiring test test_collector_error_visibility.py pins the call so
  a future refactor can't silently drop it

552 unit tests pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 7351d39 into main May 9, 2026
1 of 2 checks passed
@cipher813 cipher813 deleted the feat/flow-doctor-collector-visibility branch May 9, 2026 12:56
cipher813 added a commit that referenced this pull request May 24, 2026
…ctions (#298)

AST-walk regression pin: every production function whose name matches
fresh|stale|preflight|postflight must not contain ``.days`` calendar
arithmetic. Closes the cross-repo defect class surfaced by the
2026-05-24 Sunday SF recovery: calendar-day gates trip on every
post-Saturday redrive even when data carries the most recent NYSE close.

Escape hatch: inline ``# noqa: trading-day`` marker on the same line
documents calendar-day correctness at that specific call site.

Explicit allowlist: ``collectors/prices._find_stale_fast`` checks S3
LastModified timestamp (write-recency, not data-freshness) and is
correctly calendar-day; allowlist verified by a second test that
ensures the named function actually exists in the named file.

Pin passes on current clean state (all 5 freshness sites migrated to
``alpha_engine_lib.dates.{trading_days_stale, is_fresh_in_trading_days}``
in PR #297). Would catch any future PR that adds calendar-day
arithmetic to a freshness-named function.

Composes with the lib v0.27.0 chokepoint (lib #59) + the cross-repo
migration arc (predictor #191, research #222).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant