fix(spot): relaunch on mid-run spot interruption in DataPhase1 orchestrator#349
Merged
Merged
Conversation
…trator The 2026-05-30 Saturday SF failed in DataPhase1 when the nested data spot (i-02e498e018441751f, c5.large/us-east-1a) was reclaimed by AWS *mid-workload* with spot-request status `instance-terminated-no-capacity`. The workload SSM command returned ResponseCode -1 (lost instance), the orchestrator exited 1, and the entire weekly pipeline failed. The lib launcher (alpha_engine_lib.ec2_spot) already rotates instance_type × subnet on *acquisition* InsufficientInstanceCapacity, but nothing relaunched after a *mid-run* reclamation — that was the gap. The Friday shell-run preflight cannot catch this: it runs `--preflight-only` (boot + validate + exit 0, no workload) ~12h earlier and cannot predict Saturday-morning spot capacity; even Saturday's own entry preflight passed (the instance launched fine, then was reclaimed during the job). Adds an EXIT trap (`on_exit`) to infrastructure/spot_data_weekly.sh that classifies the failure before terminating the instance: - CONFIRMED spot interruption — spot-request status code (no-capacity / by-price / capacity-oversubscribed / marked-for- termination), instance StateReason (Server.SpotInstanceTermination / Server.InsufficientInstanceCapacity), or all-combinations-exhausted launch (ec2_spot rc 64) — self-re-execs a FRESH spot (the launcher rotates AZ/type) up to MAX_SPOT_ATTEMPTS, with a short backoff and a named CloudWatch metric (AlphaEngine/SpotInterruptionRetry) so the absorbed interruption is observable, never silent. - GENUINE workload error (instance still fulfilled/running) is NOT retried and fails loud per the fail-fast posture — blind retry would mask a real collector/prune bug. Default MAX_SPOT_ATTEMPTS=2 (one relaunch). The binding constraint is the outer SSM executionTimeout the Saturday SF sets on the orchestrator invocation (DataPhase1/MorningEnrich=5400s): a phase1 relaunch worst- cases ~65 min, fitting 5400s; a second relaunch (~107 min) would blow that timeout, so raising MAX_SPOT_ATTEMPTS requires bumping the matching SF executionTimeout in lockstep (documented inline). Trap installed before the launch so it also covers launch-time capacity exhaustion. Verified end-to-end with a stubbed-aws harness across all four paths (genuine→no-retry, interruption→retry-then-fail-loud, launch-exhaustion→retry, success→clean exit). New static-grep test pins the behavior (mirrors test_spot_data_weekly_run_modes.py). Suite: 1702 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What broke
The 2026-05-30 Saturday SF (
alpha-engine-saturday-pipeline, execbf00991d…) failed in DataPhase1 at 02:41 PT, ~41 min in. Root cause: the nested data spot (i-02e498e018441751f, c5.large/us-east-1a) was reclaimed by AWS mid-workload — spot request closed withinstance-terminated-no-capacity("no Spot capacity available that matches your request"). The workload SSM command returnedResponseCode -1(lost instance, empty stdout/stderr), the orchestrator exited 1, and the whole weekly pipeline (Research, Predictor training, Backtester) never ran.This is not a code bug — it's a transient AWS spot reclamation. The gap is that nothing relaunched after it.
Why the Friday preflight didn't catch it
The Friday "shell run" (
shell_run: true) runsspot_data_weekly.sh --preflight-only: boot a spot → validate deps/config/creds →exit 0, no workload, ~12h before the real run. It structurally cannot catch this:The fix
alpha_engine_lib.ec2_spotalready rotates instance_type × subnet on acquisition capacity errors; this adds the missing mid-run resilience. Newon_exitEXIT trap ininfrastructure/spot_data_weekly.shclassifies the failure before terminating the instance:MAX_SPOT_ATTEMPTS, with backoff + a namedAlphaEngine/SpotInterruptionRetryCloudWatch metric (observable, not silent). Signals: spot-request status code (instance-terminated-no-capacity/-by-price/-capacity-oversubscribed/marked-for-termination), instanceStateReason(Server.SpotInstanceTermination/Server.InsufficientInstanceCapacity), or all-combinations-exhausted launch (ec2_spotrc 64).Trap installed before the launch so it also covers launch-time capacity exhaustion.
Why
MAX_SPOT_ATTEMPTS=2(one relaunch)The binding constraint is the outer SSM
executionTimeoutthe SF sets on the orchestrator invocation (DataPhase1/MorningEnrich = 5400s). A phase1 relaunch worst-cases ~65 min (fits 5400s); a second relaunch (~107 min) would blow that outer timeout. RaisingMAX_SPOT_ATTEMPTSrequires bumping the matching SFexecutionTimeoutin lockstep — documented inline.Verification
awsharness across all four paths: genuine error → no retry (exit 1); interruption → relaunch (args preserved, attempt incremented) → fail loud once exhausted; launch-exhaustion (rc 64) → relaunch; success → clean exit.tests/test_spot_data_weekly_interruption_retry.py(19 cases) pins the behavior, mirroringtest_spot_data_weekly_run_modes.py.bash -nclean. Full suite: 1702 passed, 1 skipped.Deploy / redrive note
Code-only change; the orchestrator runs on the ae-dashboard EC2 from
/home/ec2-user/alpha-engine-dataand isgit pull-ed at SF runtime. After merge, the Saturday SF can be redriven to recover today's missed weekly pipeline (capacity has likely returned) — now on the resilient code path.🤖 Generated with Claude Code