Skip to content

feat(deploy): auto-rollback on post-switch error-rate breach#315

Merged
mikewheeleer merged 1 commit into
Talenttrust:mainfrom
Jayking40:#264-Add-automatic-rollback-trigger-on-post-switch-error-rate-threshold-in-deploy.ts
Jun 2, 2026
Merged

feat(deploy): auto-rollback on post-switch error-rate breach#315
mikewheeleer merged 1 commit into
Talenttrust:mainfrom
Jayking40:#264-Add-automatic-rollback-trigger-on-post-switch-error-rate-threshold-in-deploy.ts

Conversation

@Jayking40

Copy link
Copy Markdown
Contributor

feat(deploy): Automatic rollback on post-switch error-rate breach

Summary

After a blue → green switch, regressions were previously only caught manually
via deploy:status. This PR adds an automatic safety net: after promoting
green, the deployer observes the HTTP error rate for a configurable soak
window
and automatically rolls back to the previous color if a threshold is
breached.

The error rate is derived from the http_requests_total Prometheus counter and
computed as the delta from a baseline snapshot captured at switch time, so
only traffic served after the switch is judged — not the process's lifetime
average. Every decision is emitted as a structured deploy_decision log so the
reasoning behind keeping or reverting a deployment is fully auditable.

Changes

  • Added a post-switch soak monitor that samples the error-rate metric at a fixed
    interval and triggers an automatic rollback when the 5xx fraction exceeds the
    configured threshold.
  • Made thresholds and timing fully environment-configurable and validated:
    master enable switch, error-rate threshold (0..1), soak window, sample
    interval, and a minimum-request floor. Invalid values fail fast with
    secret-free errors; window and interval are clamped to safe bounds.
  • Emitted structured deploy-decision logs for every verdict (soak start, each
    observation, breach, completed rollback, retained deployment, and skips).
  • Guarded the rollback path so it never reverts a deployment it didn't promote,
    is idempotent (a repeated or concurrent run cannot double-revert), and is
    always reflected by deploy:status.
  • Added noise resistance via a minimum-request threshold so a handful of early
    errors cannot revert a healthy deployment.
  • Wired the soak into the switch-green command and added a standalone
    auto-rollback command/script for gating CI/CD steps.
  • Documented the feature, configuration, and security notes; updated the
    blue-green guide and the environment example.
  • Removed a dead, broken redaction snippet in the logger (it called a
    non-existent Set.flatMap) that blocked importing the structured logger.

Testing

  • npm testdeploy.test.ts passes (38 tests), stable across repeated runs.
  • New tests cover the required acceptance scenarios and edge cases:
    • Healthy soak → no rollback, deployment retained.
    • Breached soak → automatic rollback, reflected by getStatus.
    • Sub-threshold error rate → no rollback.
    • Insufficient request volume → no rollback (insufficient-data).
    • Delta-vs-lifetime accounting (pre-switch errors are ignored).
    • Disabled flag and not-green guard skip the soak.
    • Idempotency: re-running after rollback is a safe no-op.
    • Config parsing/validation: defaults, valid overrides, clamping, and
      rejection of invalid booleans/floats/integers/ranges.
    • Default metrics-backed reader derives 5xx counts from the registry.
  • 100% line coverage on the changed deployment module (requirement: 95%).
  • npm run lint passes clean on all changed files.

Security notes

  • No secrets are logged; decision logs carry only metrics and config values, and
    the shared logger redacts known sensitive keys.
  • All environment input is validated and bounded, preventing unbounded loops or
    registry hammering from a misconfigured environment.
  • The rollback transition is idempotent and persisted, so concurrent or repeated
    invocations cannot produce an inconsistent state.

Closes #264

@drips-wave

drips-wave Bot commented Jun 1, 2026

Copy link
Copy Markdown

@Jayking40 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@mikewheeleer mikewheeleer merged commit 92122c1 into Talenttrust:main Jun 2, 2026
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add automatic rollback trigger on post-switch error-rate threshold in deploy.ts

2 participants