feat(sf): research perf (skip_dry_run_gate) + named timeout alarm (L4464)#350
Merged
Merged
Conversation
… timeout alarm (L4464) Two safe, additive parts of the L4464 fix (the load-bearing universe-reduction ships in alpha-engine-research #256): 1. skip_dry_run_gate=true on the Saturday SF Research Lambda payload. The in-handler stub-LLM dry-run gate ran a FULL second graph pass — including a real ~4-min fetch_data — before the real pass (~8 min of the 900s budget). Wiring is validated by CI + the Friday shell-run preflight; the gate stays available for manual/dev invokes. test_sf_payload_uniqueness registry updated + a value-pin test added. 2. setup_research_runner_timeout_alarm.sh — CloudWatch alarm on the research-runner Lambda Duration Maximum >= 870000 ms (30s below the 900s ceiling). A hard Lambda timeout runs no in-process code and does NOT hit the Errors metric, so the existing -errors alarm missed it (operator saw only a generic PipelineFailure). This names the timeout cause and gives an early-warning on near-miss overruns. Routes to alpha-engine-alerts. NOT included: the Predictor-∥-Scanner topology move (an 11-state restructure of the Scanner/RAG/regime-substrate chain into the Research parallel branch). That's optimization-only and warrants its own fully-wiring-tested PR — filed as a follow-up. Suite 1707 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two safe, additive parts of the L4464 Research-timeout fix. The load-bearing fix (universe reduction so sector teams screen ~60, not ~903) is alpha-engine-research #256.
1.
skip_dry_run_gate=trueon the Saturday SF Research payload (perf)The in-handler stub-LLM dry-run gate ran a full second graph pass — including a real ~4-min
fetch_data— before the real pass (~8 min of the 900s budget, on the critical path every Saturday). Wiring is validated by CI + the Friday shell-run preflight, so the scheduled production path skips it; the gate stays available for manual/dev invokes.test_sf_payload_uniquenessregistry updated + a value-pin test added.2.
setup_research_runner_timeout_alarm.sh(named timeout alarm)A hard Lambda timeout runs no in-process code and does not increment the
Errorsmetric, so the existing-errorsalarm missed the 2026-05-30 timeout — the operator saw only a genericPipelineFailure. New alarm on LambdaDurationMaximum ≥ 870000 ms (30s below the 900s ceiling) →alpha-engine-alerts. Fires on a timeout AND on a near-miss overrun (early warning). Operator runs the one-shot script post-merge.Not included — Predictor‖Scanner topology move (deferred)
Moving
PredictorTrainingto fork parallel toScannerafterDataPhase1requires relocating the 11-state Scanner / RAG / regime-substrate chain into the Research parallel branch and rewiring every error edge from the top-levelHandleFailureto the branch-fail path. That's optimization-only (Predictor ⊥ Scanner/RAG/Research) and a delicate restructure of the already-fragile Saturday SF — it warrants its own fully-wiring-tested PR rather than bundling into the recovery push. Captured in the plan doc + filed as a follow-up.Suite 1707 passing. Deploy:
deploy_step_function.sh(SF) + run the alarm script. Held for review.🤖 Generated with Claude Code