fix(spot): relaunch on mid-run spot interruption in DataPhase1 orchestrator by cipher813 · Pull Request #349 · cipher813/alpha-engine-data

cipher813 · 2026-05-30T12:54:09Z

What broke

The 2026-05-30 Saturday SF (alpha-engine-saturday-pipeline, exec bf00991d…) failed in DataPhase1 at 02:41 PT, ~41 min in. Root cause: the nested data spot (i-02e498e018441751f, c5.large/us-east-1a) was reclaimed by AWS mid-workload — spot request closed with instance-terminated-no-capacity ("no Spot capacity available that matches your request"). The workload SSM command returned ResponseCode -1 (lost instance, empty stdout/stderr), the orchestrator exited 1, and the whole weekly pipeline (Research, Predictor training, Backtester) never ran.

This is not a code bug — it's a transient AWS spot reclamation. The gap is that nothing relaunched after it.

Why the Friday preflight didn't catch it

The Friday "shell run" (shell_run: true) runs spot_data_weekly.sh --preflight-only: boot a spot → validate deps/config/creds → exit 0, no workload, ~12h before the real run. It structurally cannot catch this:

Spot capacity is a real-time AWS event — Friday 1:30 PM capacity says nothing about Saturday 2:24 AM.
The preflight never runs the workload, and the reclamation happened during the workload.
Even Saturday's own entry preflight passed — the instance launched and bootstrapped fine, then was reclaimed.

The fix

alpha_engine_lib.ec2_spot already rotates instance_type × subnet on acquisition capacity errors; this adds the missing mid-run resilience. New on_exit EXIT trap in infrastructure/spot_data_weekly.sh classifies the failure before terminating the instance:

Confirmed spot interruption → relaunch a fresh spot (launcher rotates AZ/type) up to MAX_SPOT_ATTEMPTS, with backoff + a named AlphaEngine/SpotInterruptionRetry CloudWatch metric (observable, not silent). Signals: spot-request status code (instance-terminated-no-capacity / -by-price / -capacity-oversubscribed / marked-for-termination), instance StateReason (Server.SpotInstanceTermination / Server.InsufficientInstanceCapacity), or all-combinations-exhausted launch (ec2_spot rc 64).
Genuine workload error (instance still fulfilled/running) → not retried, fails loud per the fail-fast posture. Blind retry would mask a real collector/prune bug.

Trap installed before the launch so it also covers launch-time capacity exhaustion.

Why `MAX_SPOT_ATTEMPTS=2` (one relaunch)

The binding constraint is the outer SSM executionTimeout the SF sets on the orchestrator invocation (DataPhase1/MorningEnrich = 5400s). A phase1 relaunch worst-cases ~65 min (fits 5400s); a second relaunch (~107 min) would blow that outer timeout. Raising MAX_SPOT_ATTEMPTS requires bumping the matching SF executionTimeout in lockstep — documented inline.

Verification

Trap logic exercised end-to-end with a stubbed-aws harness across all four paths: genuine error → no retry (exit 1); interruption → relaunch (args preserved, attempt incremented) → fail loud once exhausted; launch-exhaustion (rc 64) → relaunch; success → clean exit.
New static-grep test tests/test_spot_data_weekly_interruption_retry.py (19 cases) pins the behavior, mirroring test_spot_data_weekly_run_modes.py.
bash -n clean. Full suite: 1702 passed, 1 skipped.

Deploy / redrive note

Code-only change; the orchestrator runs on the ae-dashboard EC2 from /home/ec2-user/alpha-engine-data and is git pull-ed at SF runtime. After merge, the Saturday SF can be redriven to recover today's missed weekly pipeline (capacity has likely returned) — now on the resilient code path.

🤖 Generated with Claude Code

…trator The 2026-05-30 Saturday SF failed in DataPhase1 when the nested data spot (i-02e498e018441751f, c5.large/us-east-1a) was reclaimed by AWS *mid-workload* with spot-request status `instance-terminated-no-capacity`. The workload SSM command returned ResponseCode -1 (lost instance), the orchestrator exited 1, and the entire weekly pipeline failed. The lib launcher (alpha_engine_lib.ec2_spot) already rotates instance_type × subnet on *acquisition* InsufficientInstanceCapacity, but nothing relaunched after a *mid-run* reclamation — that was the gap. The Friday shell-run preflight cannot catch this: it runs `--preflight-only` (boot + validate + exit 0, no workload) ~12h earlier and cannot predict Saturday-morning spot capacity; even Saturday's own entry preflight passed (the instance launched fine, then was reclaimed during the job). Adds an EXIT trap (`on_exit`) to infrastructure/spot_data_weekly.sh that classifies the failure before terminating the instance: - CONFIRMED spot interruption — spot-request status code (no-capacity / by-price / capacity-oversubscribed / marked-for- termination), instance StateReason (Server.SpotInstanceTermination / Server.InsufficientInstanceCapacity), or all-combinations-exhausted launch (ec2_spot rc 64) — self-re-execs a FRESH spot (the launcher rotates AZ/type) up to MAX_SPOT_ATTEMPTS, with a short backoff and a named CloudWatch metric (AlphaEngine/SpotInterruptionRetry) so the absorbed interruption is observable, never silent. - GENUINE workload error (instance still fulfilled/running) is NOT retried and fails loud per the fail-fast posture — blind retry would mask a real collector/prune bug. Default MAX_SPOT_ATTEMPTS=2 (one relaunch). The binding constraint is the outer SSM executionTimeout the Saturday SF sets on the orchestrator invocation (DataPhase1/MorningEnrich=5400s): a phase1 relaunch worst- cases ~65 min, fitting 5400s; a second relaunch (~107 min) would blow that timeout, so raising MAX_SPOT_ATTEMPTS requires bumping the matching SF executionTimeout in lockstep (documented inline). Trap installed before the launch so it also covers launch-time capacity exhaustion. Verified end-to-end with a stubbed-aws harness across all four paths (genuine→no-retry, interruption→retry-then-fail-loud, launch-exhaustion→retry, success→clean exit). New static-grep test pins the behavior (mirrors test_spot_data_weekly_run_modes.py). Suite: 1702 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cipher813 merged commit 096827e into main May 30, 2026
1 check passed

cipher813 deleted the fix/spot-interruption-retry-dataphase1 branch May 30, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spot): relaunch on mid-run spot interruption in DataPhase1 orchestrator#349

fix(spot): relaunch on mid-run spot interruption in DataPhase1 orchestrator#349
cipher813 merged 1 commit into
mainfrom
fix/spot-interruption-retry-dataphase1

cipher813 commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 30, 2026

What broke

Why the Friday preflight didn't catch it

The fix

Why MAX_SPOT_ATTEMPTS=2 (one relaunch)

Verification

Deploy / redrive note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why `MAX_SPOT_ATTEMPTS=2` (one relaunch)