feat(deploy): auto-rollback on post-switch error-rate breach#315
Merged
Conversation
|
@Jayking40 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(deploy): Automatic rollback on post-switch error-rate breach
Summary
After a blue → green switch, regressions were previously only caught manually
via
deploy:status. This PR adds an automatic safety net: after promotinggreen, the deployer observes the HTTP error rate for a configurable soak
window and automatically rolls back to the previous color if a threshold is
breached.
The error rate is derived from the
http_requests_totalPrometheus counter andcomputed as the delta from a baseline snapshot captured at switch time, so
only traffic served after the switch is judged — not the process's lifetime
average. Every decision is emitted as a structured
deploy_decisionlog so thereasoning behind keeping or reverting a deployment is fully auditable.
Changes
interval and triggers an automatic rollback when the 5xx fraction exceeds the
configured threshold.
master enable switch, error-rate threshold (0..1), soak window, sample
interval, and a minimum-request floor. Invalid values fail fast with
secret-free errors; window and interval are clamped to safe bounds.
observation, breach, completed rollback, retained deployment, and skips).
is idempotent (a repeated or concurrent run cannot double-revert), and is
always reflected by
deploy:status.errors cannot revert a healthy deployment.
switch-greencommand and added a standaloneauto-rollbackcommand/script for gating CI/CD steps.blue-green guide and the environment example.
non-existent
Set.flatMap) that blocked importing the structured logger.Testing
npm test—deploy.test.tspasses (38 tests), stable across repeated runs.getStatus.insufficient-data).rejection of invalid booleans/floats/integers/ranges.
npm run lintpasses clean on all changed files.Security notes
the shared logger redacts known sensitive keys.
registry hammering from a misconfigured environment.
invocations cannot produce an inconsistent state.
Closes #264