Skip to content

feat(deploy): Telegram + SNS alert on canary rollback (L221 — 3/5)#285

Merged
cipher813 merged 1 commit into
mainfrom
feat/canary-rollback-alerts-l221
May 21, 2026
Merged

feat(deploy): Telegram + SNS alert on canary rollback (L221 — 3/5)#285
cipher813 merged 1 commit into
mainfrom
feat/canary-rollback-alerts-l221

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

3/5 of the L221 fleet pass. The repo that originated the recurrence class. `infrastructure/deploy.sh` is the script whose silent canary rollback fired 10 consecutive times across 2 days in the #274 retrospective. This PR adds the surveillance line that would have caught it at hit #1.

Implementation

Single insert in the canary-failure rollback block (L185-200). `python3 -m alpha_engine_lib.alerts publish --severity error --message "Canary rolled back: ..."` before `exit 1`. Trailing `|| true` ensures the alert publish failure never overrides the deploy's intended exit code.

The message includes the function name, canary status, and the version transition (v→v-1) so the operator can match the alert to the rollback log.

Sub-Lambda audit (negative)

The 4 sub-Lambda deploys don't have canary/rollback paths and stay unchanged:

  • `infrastructure/lambdas/spot-orphan-reaper/deploy.sh`
  • `infrastructure/lambdas/changelog-cloudwatch-mirror/deploy.sh`
  • `infrastructure/lambdas/eod-success-friday-shell-trigger/deploy.sh`
  • `infrastructure/lambdas/sf-telegram-notifier/deploy.sh`

The 5th sub-Lambda `changelog-incident-mirror/deploy.sh` already uses `alpha_engine_lib.alerts` (L143/L146 prior fleet pass, alpha-engine-data #277).

Fleet pass scope

repo sites status
alpha-engine-research 1 #216
alpha-engine-predictor 3 #184
alpha-engine-data 1 (main) this PR
alpha-engine-backtester 3 (health / counterfactual / concordance) TBD
alpha-engine-dashboard n/a n/a

Test plan

  • `bash -n infrastructure/deploy.sh` clean
  • Pre-commit hooks pass
  • First production exercise on the next canary rollback (which by definition is the failure mode this surveillance is designed for)

🤖 Generated with Claude Code

Independent-channel surveillance on the canary-rollback path that
fired silently 10 consecutive times across 2 days in the #274
retrospective. Best-effort lib alerts.publish before exit 1; trailing
|| true never overrides the deploy's exit code.

The 4 sub-Lambda deploys (spot-orphan-reaper / changelog-cloudwatch-mirror
/ eod-success-friday-shell-trigger / sf-telegram-notifier) don't have
canary/rollback paths — bootstrap-style deploys without a gate — so
no edit needed there. The changelog-incident-mirror already uses lib
alerts (per the L143/L146 fleet pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 09b5705 into main May 21, 2026
1 check passed
@cipher813 cipher813 deleted the feat/canary-rollback-alerts-l221 branch May 21, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant