Skip to content

fix(eod-sf): hardcode SNS ARN + Catch on HandleFailure (cost-guard)#240

Merged
cipher813 merged 1 commit into
mainfrom
fix/eod-sf-cost-guard-hardening
May 14, 2026
Merged

fix(eod-sf): hardcode SNS ARN + Catch on HandleFailure (cost-guard)#240
cipher813 merged 1 commit into
mainfrom
fix/eod-sf-cost-guard-hardening

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • 2026-05-14 EOD recovery surfaced a real design gap: HandleFailure had no Catch, so any SNS publish failure aborted the SF before reaching ForceStopInstance, leaving the trading EC2 running. Today's specific trigger was a malformed sns_topic_arn in the recovery payload (colon → space substitution between region and account ID), but the design has zero defense against any SNS-side failure (throttling, IAM drift, outage).
  • Fix feat: data quality gates — parquet validation + email fixes #1: hardcode the SNS topic ARN in the SF definition. Removes the corruption surface entirely.
  • Fix feat: VWAP data plumbing + SSM secrets + push scripts #2: add States.ALL Catch on HandleFailureForceStopInstance. Defense-in-depth so the cost-guard fires regardless of alert delivery.

Live verification

Deployed via update_eod_pipeline_sf.sh. describe-state-machine confirms:

  • HandleFailure.Parameters.TopicArn = literal arn:aws:sns:us-east-1:711398986525:alpha-engine-alerts (no .$ indirection)
  • HandleFailure.Catch = [{ErrorEquals: [States.ALL], Next: ForceStopInstance}]

Tests

  • tests/test_sf_eod_substrate_check_wiring.py — 3 new pins under TestHandleFailureCostGuardHardening:
    • test_topic_arn_is_literal_not_jsonpath
    • test_handle_failure_has_catch_to_force_stop_instance
    • test_input_schema_no_longer_requires_sns_topic_arn
  • Full suite: 1035 passed, 1 skipped (was 1032, +3).

Test plan

  • pytest tests/test_sf_eod_substrate_check_wiring.py (20 passed)
  • pytest tests/ (1035 passed)
  • bash infrastructure/update_eod_pipeline_sf.sh — JSON valid, definition updated on live SF
  • aws stepfunctions describe-state-machine confirms hardcoded ARN + Catch on live SF
  • Next daemon-triggered EOD SF firing exercises the new shape (no input field needed; recovery payloads can drop sns_topic_arn)

Composes with

  • alpha-engine-data#238 (today's daily_append column-order hotfix)
  • alpha-engine#181 (today's eod_reconcile macro-lib dispatch)
  • ROADMAP entry under "New work added 2026-05-14" — to be closed in the next alpha-engine-config commit.

🤖 Generated with Claude Code

2026-05-14 EOD recovery surfaced a real design gap in the EOD SF.
When HandleFailure's SNS publish fails for any reason — malformed
$.sns_topic_arn (today's recovery payload had a colon → space
substitution between us-east-1 and the account ID), SNS throttling,
IAM drift, transient outage — the entire SF aborted before reaching
ForceStopInstance, leaving the trading EC2 running until manual stop.
The state's own comment ("Failure alert via SNS — instance still
stops to avoid cost") was unenforced.

Two-part defensive fix:

1. Hardcode the SNS topic ARN (no $.sns_topic_arn indirection). The
   ARN is fixed; per-execution variability buys nothing and creates
   a corruption surface. Today's recovery-input space-instead-of-colon
   would have been impossible.

2. Catch States.ALL on HandleFailure → ForceStopInstance so the
   cost-guard fires regardless of alert delivery (defense-in-depth
   even with #1 in place — covers SNS outages, IAM drift, future
   failure modes).

Live verification: deployed via update_eod_pipeline_sf.sh; describe-
state-machine confirms `TopicArn` is literal + Catch routes to
ForceStopInstance.

Tests: full alpha-engine-data suite 1035 passed (was 1032; +3 wiring
pins in test_sf_eod_substrate_check_wiring.py — TopicArn-is-literal,
HandleFailure-has-States.ALL-catch-to-ForceStopInstance, no-state-
binds-$.sns_topic_arn).

Composes with PR #238 (today's daily_append column-order hotfix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 9e799d1 into main May 14, 2026
1 check passed
@cipher813 cipher813 deleted the fix/eod-sf-cost-guard-hardening branch May 14, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant