feat(vmcp): add auth-retry with circuit breaker to BackendClient#4169
feat(vmcp): add auth-retry with circuit breaker to BackendClient#4169
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
There was a problem hiding this comment.
Pull request overview
Adds an auth-failure retry decorator (with exponential backoff + per-backend circuit breaker + tracing) around the vMCP BackendClient, expands auth error string detection, and introduces new unit/integration/E2E test coverage plus test utilities to simulate transient HTTP failures.
Changes:
- Wrap
NewHTTPBackendClientwithretryingBackendClientimplementing auth retry/backoff + circuit breaker + OTel spans. - Refactor
IsAuthenticationErrorinto a sharedauthErrorPatternslist (including mcp-go"unauthorized (401)"). - Add test helpers and new unit/integration/E2E tests for auth-retry and circuit breaker behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/client/auth_retry.go | Implements auth-retry decorator w/ backoff, circuit breaker, singleflight, and OTel spans |
| pkg/vmcp/client/client.go | Wraps the existing HTTP backend client with the retry decorator |
| pkg/vmcp/errors.go | Refactors authentication error detection into pattern list and expands coverage |
| pkg/vmcp/client/auth_retry_test.go | Unit tests for retry/circuit breaker behavior |
| pkg/vmcp/client/auth_retry_integration_test.go | Integration tests using a real mcp-go HTTP server with transient 401 responses |
| pkg/vmcp/client/client_test.go | Updates tests to account for the new retry decorator return type |
| test/integration/vmcp/helpers/backend.go | Adds WithHTTPMiddleware to inject transient HTTP failures in integration tests |
| test/e2e/thv-operator/virtualmcp/helpers.go | Extends E2E backend deployment helper (Args + PodTemplateSpec) and adds health status const |
| test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go | New E2E test validating retry exhaustion behavior against a persistent-401 backend |
| test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go | Minor refactor to use shared health-status constant |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ab7f12704
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
There was a problem hiding this comment.
Pull request overview
Adds authentication-failure retry behavior (with backoff + circuit breaker + tracing) to the vMCP BackendClient, expands auth-error detection patterns, and introduces new test hooks/coverage to validate the behavior end-to-end.
Changes:
- Wrap
NewHTTPBackendClientwith aretryingBackendClientdecorator that retriesErrAuthenticationFailedwith exponential backoff and a per-backend circuit breaker; instruments retries with OTel spans. - Refactor
IsAuthenticationErrorto use a pattern list, including the mcp-go"unauthorized (401)"error format. - Extend integration/E2E test helpers and add new unit/integration/E2E tests for auth-retry and circuit-breaker behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/vmcp/client/auth_retry.go |
Implements auth retry loop with backoff, circuit breaker, singleflight coalescing, and OTel spans. |
pkg/vmcp/client/client.go |
Wraps the HTTP backend client with the retry decorator; routes errors through wrapBackendError. |
pkg/vmcp/errors.go |
Refactors auth error string matching into authErrorPatterns and adds additional patterns. |
pkg/vmcp/client/auth_retry_test.go |
Unit tests for retry behavior, breaker open/reset, concurrency coalescing, and cancellation. |
pkg/vmcp/client/auth_retry_integration_test.go |
Integration test validating transient-401 retry against a real mcp-go streamable HTTP server. |
pkg/vmcp/client/client_test.go |
Updates tests to account for the new retry-wrapper returned by NewHTTPBackendClient. |
test/integration/vmcp/helpers/backend.go |
Adds WithHTTPMiddleware for injecting transient HTTP behavior in integration tests. |
test/e2e/thv-operator/virtualmcp/helpers.go |
Extends backend config to support args/pod template patches and skipping readiness wait; adds a shared health-status constant. |
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go |
New E2E coverage validating persistent 401 leads to “unavailable” backend status. |
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go |
Replaces hard-coded "healthy" with shared constant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4169 +/- ##
==========================================
- Coverage 69.30% 69.20% -0.10%
==========================================
Files 466 467 +1
Lines 46774 46920 +146
==========================================
+ Hits 32416 32471 +55
+ Misses 11871 11861 -10
- Partials 2487 2588 +101 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds an authentication-failure retry decorator (with exponential backoff, per-backend circuit breaker, singleflight coalescing, and OTel spans) around the vMCP BackendClient, and refactors error classification / test helpers to support the new behavior.
Changes:
- Wrap
NewHTTPBackendClientwith aretryingBackendClientand introducehttpStatusRoundTripperto map HTTP 401/403 and 5xx into vmcp sentinel errors forerrors.Ischecks. - Refactor
IsAuthenticationError/IsConnectionErrorto use package-level pattern slices and expand auth matching. - Add/adjust integration + E2E test helpers/tests to inject transient HTTP failures and validate auth-retry / circuit-breaker behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/vmcp/client/client.go |
Wraps the backend client with retry logic; adds HTTP status → sentinel error mapping; routes more call errors through wrapBackendError. |
pkg/vmcp/client/auth_retry.go |
Implements auth-retry with exponential backoff, per-backend circuit breaker, singleflight coalescing, and OTel span instrumentation. |
pkg/vmcp/errors.go |
Refactors string-based auth/connection error detection to pattern slices and expands auth patterns. |
pkg/vmcp/client/client_test.go |
Updates tests to account for NewHTTPBackendClient now returning a retrying decorator. |
pkg/vmcp/client/auth_retry_test.go |
Adds unit tests covering retry success/exhaustion, breaker open/reset, concurrency coalescing, and context cancellation. |
pkg/vmcp/client/auth_retry_integration_test.go |
Adds integration test exercising retry behavior against an httptest-backed MCP server returning a transient 401. |
test/integration/vmcp/helpers/backend.go |
Adds WithHTTPMiddleware to wrap backend HTTP handlers for injecting transient HTTP failures in integration tests. |
test/e2e/thv-operator/virtualmcp/helpers.go |
Adds helpers/fields to support new E2E scenarios (pod-only readiness wait, backend args/pod template patching, skip readiness gate, health status constant). |
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go |
Replaces literal "healthy" comparisons with shared constant. |
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go |
Adds E2E test validating persistent-401 backend transitions to unavailable/degraded while stable backend remains healthy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
`NewHTTPBackendClient` now wraps the raw HTTP client in a `retryingBackendClient` decorator that: - Intercepts `ErrAuthenticationFailed` (401/403) returned by any `BackendClient` method and retries up to `maxAuthRetries` (3) times with exponential backoff (100 ms base, doubling each attempt). - Maintains a per-backend `authCircuitBreaker` that opens after `authCircuitBreakerThreshold` (5) consecutive fully-exhausted retry sequences, preventing runaway latency from permanently broken credentials. - Uses `singleflight` to coalesce concurrent backoff waits for the same backend, so N goroutines racing on a 401 sleep only once per attempt. - Wraps each retry sequence in an OpenTelemetry `auth.retry` span so the overhead is visible in distributed traces. - Never logs raw credentials. `IsAuthenticationError` is refactored from chained `if` blocks into a package-level `authErrorPatterns` slice (fixes gocyclo limit and adds the mcp-go `"unauthorized (401)"` format that was previously missed). The integration test helper gains `WithHTTPMiddleware` so tests can inject transient HTTP errors without modifying the MCP server logic. New tests: - 9 unit tests (`auth_retry_test.go`) covering success, single-retry, max-retries, non-auth passthrough, circuit breaker open/reset, and context cancellation. - 1 integration test (`auth_retry_integration_test.go`) against a real mcp-go streamable-HTTP server. - 1 E2E Ginkgo suite (`virtualmcp_auth_retry_test.go`) deploying a Python 401 backend alongside a healthy yardstick backend and asserting that the failing backend is marked `BackendStatusUnavailable` while the stable backend remains ready. Closes: #3869
There was a problem hiding this comment.
Pull request overview
Adds an auth-failure retry decorator around the vMCP BackendClient, converting HTTP status codes into sentinel errors to enable errors.Is()-based detection and supporting end-to-end verification of auth-retry exhaustion behavior.
Changes:
- Wrap
httpBackendClientwith a newretryingBackendClientthat retriesErrAuthenticationFailedwith exponential backoff and a per-backend circuit breaker; coalesces concurrent backoff sleeps viasingleflightand instruments retries with anauth.retryspan. - Introduce an HTTP
RoundTripperthat turns 401/403 and selected 5xx responses into structured sentinel errors and updates backend error wrapping to prefer these sentinels. - Expand integration/E2E test helpers and add unit + integration + E2E tests to exercise transient/persistent 401 behavior and health reporting.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/vmcp/client/client.go |
Wraps the HTTP backend client with retrying decorator; adds httpStatusRoundTripper; updates error wrapping call sites to use shared wrapBackendError. |
pkg/vmcp/client/auth_retry.go |
New retry decorator implementing auth retry/backoff, circuit breaker, singleflight coalescing, and OTel span instrumentation. |
pkg/vmcp/errors.go |
Refactors auth/connection string matchers into pattern slices and expands auth patterns (incl. unauthorized (401)). |
pkg/vmcp/client/client_test.go |
Updates tests to account for NewHTTPBackendClient now returning the retrying decorator. |
pkg/vmcp/client/auth_retry_test.go |
New unit tests for retry success, max-retry exhaustion, circuit breaker behavior, and concurrency coalescing. |
pkg/vmcp/client/auth_retry_integration_test.go |
New integration test validating transient 401 retry succeeds end-to-end with a real mcp-go server. |
test/integration/vmcp/helpers/backend.go |
Adds WithHTTPMiddleware to allow injecting transient HTTP failures in integration tests. |
test/e2e/thv-operator/virtualmcp/helpers.go |
Adds pod-only readiness wait helper; extends backend config to support args, pod template patching, and skipping readiness waits; adds backendHealthStatusHealthy constant. |
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go |
Uses the new backendHealthStatusHealthy constant instead of hardcoded "healthy". |
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go |
New E2E test deploying a persistent-401 backend and asserting it becomes unavailable while a stable backend remains healthy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func NewHTTPBackendClient(registry vmcpauth.OutgoingAuthRegistry) (vmcp.BackendClient, error) { | ||
| if registry == nil { | ||
| return nil, fmt.Errorf("registry cannot be nil; use UnauthenticatedStrategy for no authentication") | ||
| } | ||
|
|
||
| c := &httpBackendClient{ | ||
| registry: registry, | ||
| } | ||
| c.clientFactory = c.defaultClientFactory | ||
| return c, nil | ||
| return newRetryingBackendClient(c, registry), nil | ||
| } |
| type retryingBackendClient struct { | ||
| inner vmcp.BackendClient | ||
| registry vmcpauth.OutgoingAuthRegistry | ||
|
|
jerm-dro
left a comment
There was a problem hiding this comment.
FYI: I think this potentially overlaps with the embedded auth in vMCP work: https://github.com/stacklok/stacklok-epics/issues/251
As part of that work, we want to solve the token refresh problem. It may make sense for those changes to land first.
| // breakers maps backendID -> *authCircuitBreaker. LoadOrStore is used for concurrent safety. | ||
| breakers sync.Map |
There was a problem hiding this comment.
blocker: can we implement this without a sync.Map?
The multi-session already has the notion of separate clients per backend. We don't actually have to share state, right? If so, then we should have one retrier per backend.
Summary
NewHTTPBackendClientnow wraps the raw HTTP client in aretryingBackendClientdecorator that:ErrAuthenticationFailed(401/403) returned by anyBackendClientmethod and retries up tomaxAuthRetries(3) times with exponential backoff (100 ms base, doubling each attempt).authCircuitBreakerthat opens afterauthCircuitBreakerThreshold(5) consecutive fully-exhausted retry sequences, preventing runaway latency from permanently broken credentials.singleflightto coalesce concurrent backoff waits for the same backend, so N goroutines racing on a 401 sleep only once per attempt.auth.retryspan so the overhead is visible in distributed traces.IsAuthenticationErroris refactored from chainedifblocks into a package-levelauthErrorPatternsslice (fixes gocyclo limit and adds the mcp-go"unauthorized (401)"format that was previously missed).The integration test helper gains
WithHTTPMiddlewareso tests can inject transient HTTP errors without modifying the MCP server logic.Fixes #3869
Type of change
Test plan
task test)task test-e2e)task lint-fix)Changes
Does this introduce a user-facing change?
Special notes for reviewers
Large PR Justification
This is a complete and isolated functionality, with comprehensive tests. Cannot be split.