Skip to content

fix(spot): relaunch on mid-run spot interruption in DataPhase1 orchestrator#349

Merged
cipher813 merged 1 commit into
mainfrom
fix/spot-interruption-retry-dataphase1
May 30, 2026
Merged

fix(spot): relaunch on mid-run spot interruption in DataPhase1 orchestrator#349
cipher813 merged 1 commit into
mainfrom
fix/spot-interruption-retry-dataphase1

Conversation

@cipher813
Copy link
Copy Markdown
Owner

What broke

The 2026-05-30 Saturday SF (alpha-engine-saturday-pipeline, exec bf00991d…) failed in DataPhase1 at 02:41 PT, ~41 min in. Root cause: the nested data spot (i-02e498e018441751f, c5.large/us-east-1a) was reclaimed by AWS mid-workload — spot request closed with instance-terminated-no-capacity ("no Spot capacity available that matches your request"). The workload SSM command returned ResponseCode -1 (lost instance, empty stdout/stderr), the orchestrator exited 1, and the whole weekly pipeline (Research, Predictor training, Backtester) never ran.

This is not a code bug — it's a transient AWS spot reclamation. The gap is that nothing relaunched after it.

Why the Friday preflight didn't catch it

The Friday "shell run" (shell_run: true) runs spot_data_weekly.sh --preflight-only: boot a spot → validate deps/config/creds → exit 0, no workload, ~12h before the real run. It structurally cannot catch this:

  1. Spot capacity is a real-time AWS event — Friday 1:30 PM capacity says nothing about Saturday 2:24 AM.
  2. The preflight never runs the workload, and the reclamation happened during the workload.
  3. Even Saturday's own entry preflight passed — the instance launched and bootstrapped fine, then was reclaimed.

The fix

alpha_engine_lib.ec2_spot already rotates instance_type × subnet on acquisition capacity errors; this adds the missing mid-run resilience. New on_exit EXIT trap in infrastructure/spot_data_weekly.sh classifies the failure before terminating the instance:

  • Confirmed spot interruption → relaunch a fresh spot (launcher rotates AZ/type) up to MAX_SPOT_ATTEMPTS, with backoff + a named AlphaEngine/SpotInterruptionRetry CloudWatch metric (observable, not silent). Signals: spot-request status code (instance-terminated-no-capacity / -by-price / -capacity-oversubscribed / marked-for-termination), instance StateReason (Server.SpotInstanceTermination / Server.InsufficientInstanceCapacity), or all-combinations-exhausted launch (ec2_spot rc 64).
  • Genuine workload error (instance still fulfilled/running) → not retried, fails loud per the fail-fast posture. Blind retry would mask a real collector/prune bug.

Trap installed before the launch so it also covers launch-time capacity exhaustion.

Why MAX_SPOT_ATTEMPTS=2 (one relaunch)

The binding constraint is the outer SSM executionTimeout the SF sets on the orchestrator invocation (DataPhase1/MorningEnrich = 5400s). A phase1 relaunch worst-cases ~65 min (fits 5400s); a second relaunch (~107 min) would blow that outer timeout. Raising MAX_SPOT_ATTEMPTS requires bumping the matching SF executionTimeout in lockstep — documented inline.

Verification

  • Trap logic exercised end-to-end with a stubbed-aws harness across all four paths: genuine error → no retry (exit 1); interruption → relaunch (args preserved, attempt incremented) → fail loud once exhausted; launch-exhaustion (rc 64) → relaunch; success → clean exit.
  • New static-grep test tests/test_spot_data_weekly_interruption_retry.py (19 cases) pins the behavior, mirroring test_spot_data_weekly_run_modes.py.
  • bash -n clean. Full suite: 1702 passed, 1 skipped.

Deploy / redrive note

Code-only change; the orchestrator runs on the ae-dashboard EC2 from /home/ec2-user/alpha-engine-data and is git pull-ed at SF runtime. After merge, the Saturday SF can be redriven to recover today's missed weekly pipeline (capacity has likely returned) — now on the resilient code path.

🤖 Generated with Claude Code

…trator

The 2026-05-30 Saturday SF failed in DataPhase1 when the nested data
spot (i-02e498e018441751f, c5.large/us-east-1a) was reclaimed by AWS
*mid-workload* with spot-request status `instance-terminated-no-capacity`.
The workload SSM command returned ResponseCode -1 (lost instance), the
orchestrator exited 1, and the entire weekly pipeline failed.

The lib launcher (alpha_engine_lib.ec2_spot) already rotates
instance_type × subnet on *acquisition* InsufficientInstanceCapacity,
but nothing relaunched after a *mid-run* reclamation — that was the gap.
The Friday shell-run preflight cannot catch this: it runs `--preflight-only`
(boot + validate + exit 0, no workload) ~12h earlier and cannot predict
Saturday-morning spot capacity; even Saturday's own entry preflight
passed (the instance launched fine, then was reclaimed during the job).

Adds an EXIT trap (`on_exit`) to infrastructure/spot_data_weekly.sh that
classifies the failure before terminating the instance:

- CONFIRMED spot interruption — spot-request status code
  (no-capacity / by-price / capacity-oversubscribed / marked-for-
  termination), instance StateReason (Server.SpotInstanceTermination /
  Server.InsufficientInstanceCapacity), or all-combinations-exhausted
  launch (ec2_spot rc 64) — self-re-execs a FRESH spot (the launcher
  rotates AZ/type) up to MAX_SPOT_ATTEMPTS, with a short backoff and a
  named CloudWatch metric (AlphaEngine/SpotInterruptionRetry) so the
  absorbed interruption is observable, never silent.
- GENUINE workload error (instance still fulfilled/running) is NOT
  retried and fails loud per the fail-fast posture — blind retry would
  mask a real collector/prune bug.

Default MAX_SPOT_ATTEMPTS=2 (one relaunch). The binding constraint is
the outer SSM executionTimeout the Saturday SF sets on the orchestrator
invocation (DataPhase1/MorningEnrich=5400s): a phase1 relaunch worst-
cases ~65 min, fitting 5400s; a second relaunch (~107 min) would blow
that timeout, so raising MAX_SPOT_ATTEMPTS requires bumping the matching
SF executionTimeout in lockstep (documented inline).

Trap installed before the launch so it also covers launch-time capacity
exhaustion. Verified end-to-end with a stubbed-aws harness across all
four paths (genuine→no-retry, interruption→retry-then-fail-loud,
launch-exhaustion→retry, success→clean exit). New static-grep test
pins the behavior (mirrors test_spot_data_weekly_run_modes.py).

Suite: 1702 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 096827e into main May 30, 2026
1 check passed
@cipher813 cipher813 deleted the fix/spot-interruption-retry-dataphase1 branch May 30, 2026 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant