Skip to content

fix(substrate): tighten alarm Period 86400 → 3600 to close 24-37h state lag#190

Merged
cipher813 merged 1 commit into
mainfrom
fix/substrate-alarm-period-tighten
May 8, 2026
Merged

fix(substrate): tighten alarm Period 86400 → 3600 to close 24-37h state lag#190
cipher813 merged 1 commit into
mainfrom
fix/substrate-alarm-period-tighten

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

  • CloudWatch alarms with Period=86400 + EvaluationPeriods=1 evaluated only at 24h boundaries — Sat SF emissions at 12:00 UTC Sat were invisible in alarm state until ~17:05 PT Sun (~37h lag from cron start to first alarm-state refresh).
  • Switch to Period=3600 + EvalPeriods=24 + DatapointsToAlarm=1: same 24h trailing window, hourly refresh cadence, alarm reflects most recent SF emission within ~1h.
  • Cost delta: 24× more CW eval reads/day; bounded.

Live operator step (after merge)

cd ~/Development/alpha-engine-data
./infrastructure/setup_substrate_alarms.sh

Script is idempotent — applies the new Period semantics to all 10 existing alarms (9 per-row + 1 aggregate).

Test plan

  • pytest tests/test_substrate_alarms_script.py (17/17, +3 regression tests pinning Period=3600, EvalPeriods=24, DatapointsToAlarm=1)
  • bash -n infrastructure/setup_substrate_alarms.sh clean
  • After merge: re-run setup script live; verify aws cloudwatch describe-alarms --alarm-name-prefix alpha-engine-substrate- shows Period: 3600 on all 10 alarms
  • Sat 5/9 SF post-cron: alarm state should refresh by ~03:00-04:00 PT Sat instead of ~17:05 PT Sun

Related

🤖 Generated with Claude Code

…te lag

CloudWatch alarms with Period=86400 + EvaluationPeriods=1 only evaluated
at 24h boundaries, so Sat SF emissions at 12:00 UTC Sat were not visible
in alarm state until ~17:05 PT Sun (~37h lag from cron start to first
alarm-state refresh). The substrate framework's "every miss is an
incident in <24h" promise was structurally untrue at the alarm layer.

Switch to Period=3600 + EvalPeriods=24 + DatapointsToAlarm=1:
- Same 24h trailing evaluation window (24 × 3600 = 86400)
- Hourly refresh cadence instead of daily
- DatapointsToAlarm=1 means any single failing hourly period in the
  trailing 24 fires the alarm (matches the prior "any 24h Min < 1"
  semantics; weekly rows still emit once per Sat with 23 trailing
  notBreaching periods, alarm fires iff that one real datapoint is < 1)
- Cost: 24× more CloudWatch eval reads/day; bounded.

Live operator step required after merge: re-run
`./infrastructure/setup_substrate_alarms.sh` once to apply the new
Period to all 10 existing alarms (script is idempotent).

Closes ROADMAP P0 line 2082.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit e7763d4 into main May 8, 2026
1 check passed
@cipher813 cipher813 deleted the fix/substrate-alarm-period-tighten branch May 8, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant