Skip to content

feat(sf): research perf (skip_dry_run_gate) + named timeout alarm (L4464)#350

Merged
cipher813 merged 1 commit into
mainfrom
feat/l4464-research-perf-skip-gate-timeout-alarm
May 30, 2026
Merged

feat(sf): research perf (skip_dry_run_gate) + named timeout alarm (L4464)#350
cipher813 merged 1 commit into
mainfrom
feat/l4464-research-perf-skip-gate-timeout-alarm

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Two safe, additive parts of the L4464 Research-timeout fix. The load-bearing fix (universe reduction so sector teams screen ~60, not ~903) is alpha-engine-research #256.

1. skip_dry_run_gate=true on the Saturday SF Research payload (perf)

The in-handler stub-LLM dry-run gate ran a full second graph pass — including a real ~4-min fetch_data — before the real pass (~8 min of the 900s budget, on the critical path every Saturday). Wiring is validated by CI + the Friday shell-run preflight, so the scheduled production path skips it; the gate stays available for manual/dev invokes. test_sf_payload_uniqueness registry updated + a value-pin test added.

2. setup_research_runner_timeout_alarm.sh (named timeout alarm)

A hard Lambda timeout runs no in-process code and does not increment the Errors metric, so the existing -errors alarm missed the 2026-05-30 timeout — the operator saw only a generic PipelineFailure. New alarm on Lambda Duration Maximum ≥ 870000 ms (30s below the 900s ceiling) → alpha-engine-alerts. Fires on a timeout AND on a near-miss overrun (early warning). Operator runs the one-shot script post-merge.

Not included — Predictor‖Scanner topology move (deferred)

Moving PredictorTraining to fork parallel to Scanner after DataPhase1 requires relocating the 11-state Scanner / RAG / regime-substrate chain into the Research parallel branch and rewiring every error edge from the top-level HandleFailure to the branch-fail path. That's optimization-only (Predictor ⊥ Scanner/RAG/Research) and a delicate restructure of the already-fragile Saturday SF — it warrants its own fully-wiring-tested PR rather than bundling into the recovery push. Captured in the plan doc + filed as a follow-up.

Suite 1707 passing. Deploy: deploy_step_function.sh (SF) + run the alarm script. Held for review.

🤖 Generated with Claude Code

… timeout alarm (L4464)

Two safe, additive parts of the L4464 fix (the load-bearing universe-reduction
ships in alpha-engine-research #256):

1. skip_dry_run_gate=true on the Saturday SF Research Lambda payload. The
   in-handler stub-LLM dry-run gate ran a FULL second graph pass — including a
   real ~4-min fetch_data — before the real pass (~8 min of the 900s budget).
   Wiring is validated by CI + the Friday shell-run preflight; the gate stays
   available for manual/dev invokes. test_sf_payload_uniqueness registry
   updated + a value-pin test added.

2. setup_research_runner_timeout_alarm.sh — CloudWatch alarm on the
   research-runner Lambda Duration Maximum >= 870000 ms (30s below the 900s
   ceiling). A hard Lambda timeout runs no in-process code and does NOT hit
   the Errors metric, so the existing -errors alarm missed it (operator saw
   only a generic PipelineFailure). This names the timeout cause and gives an
   early-warning on near-miss overruns. Routes to alpha-engine-alerts.

NOT included: the Predictor-∥-Scanner topology move (an 11-state restructure
of the Scanner/RAG/regime-substrate chain into the Research parallel branch).
That's optimization-only and warrants its own fully-wiring-tested PR — filed
as a follow-up. Suite 1707 passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 2dae25e into main May 30, 2026
1 check passed
@cipher813 cipher813 deleted the feat/l4464-research-perf-skip-gate-timeout-alarm branch May 30, 2026 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant