Skip to content

feat(vmcp): add auth-retry with circuit breaker to BackendClient#4169

Open
yrobla wants to merge 1 commit intomainfrom
issue-3869
Open

feat(vmcp): add auth-retry with circuit breaker to BackendClient#4169
yrobla wants to merge 1 commit intomainfrom
issue-3869

Conversation

@yrobla
Copy link
Contributor

@yrobla yrobla commented Mar 16, 2026

Summary

NewHTTPBackendClient now wraps the raw HTTP client in a retryingBackendClient decorator that:

  • Intercepts ErrAuthenticationFailed (401/403) returned by any BackendClient method and retries up to maxAuthRetries (3) times with exponential backoff (100 ms base, doubling each attempt).
  • Maintains a per-backend authCircuitBreaker that opens after authCircuitBreakerThreshold (5) consecutive fully-exhausted retry sequences, preventing runaway latency from permanently broken credentials.
  • Uses singleflight to coalesce concurrent backoff waits for the same backend, so N goroutines racing on a 401 sleep only once per attempt.
  • Wraps each retry sequence in an OpenTelemetry auth.retry span so the overhead is visible in distributed traces.
  • Never logs raw credentials.

IsAuthenticationError is refactored from chained if blocks into a package-level authErrorPatterns slice (fixes gocyclo limit and adds the mcp-go "unauthorized (401)" format that was previously missed).

The integration test helper gains WithHTTPMiddleware so tests can inject transient HTTP errors without modifying the MCP server logic.

Fixes #3869

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Changes

File Change

Does this introduce a user-facing change?

Special notes for reviewers

Large PR Justification

This is a complete and isolated functionality, with comprehensive tests. Cannot be split.

@yrobla yrobla requested review from Copilot and removed request for ChrisJBurns, JAORMX, amirejaz, jerm-dro and jhrozek March 16, 2026 15:58
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Mar 16, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an auth-failure retry decorator (with exponential backoff + per-backend circuit breaker + tracing) around the vMCP BackendClient, expands auth error string detection, and introduces new unit/integration/E2E test coverage plus test utilities to simulate transient HTTP failures.

Changes:

  • Wrap NewHTTPBackendClient with retryingBackendClient implementing auth retry/backoff + circuit breaker + OTel spans.
  • Refactor IsAuthenticationError into a shared authErrorPatterns list (including mcp-go "unauthorized (401)").
  • Add test helpers and new unit/integration/E2E tests for auth-retry and circuit breaker behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/vmcp/client/auth_retry.go Implements auth-retry decorator w/ backoff, circuit breaker, singleflight, and OTel spans
pkg/vmcp/client/client.go Wraps the existing HTTP backend client with the retry decorator
pkg/vmcp/errors.go Refactors authentication error detection into pattern list and expands coverage
pkg/vmcp/client/auth_retry_test.go Unit tests for retry/circuit breaker behavior
pkg/vmcp/client/auth_retry_integration_test.go Integration tests using a real mcp-go HTTP server with transient 401 responses
pkg/vmcp/client/client_test.go Updates tests to account for the new retry decorator return type
test/integration/vmcp/helpers/backend.go Adds WithHTTPMiddleware to inject transient HTTP failures in integration tests
test/e2e/thv-operator/virtualmcp/helpers.go Extends E2E backend deployment helper (Args + PodTemplateSpec) and adds health status const
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go New E2E test validating retry exhaustion behavior against a persistent-401 backend
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go Minor refactor to use shared health-status constant

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ab7f12704

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 16, 2026
@github-actions github-actions bot dismissed their stale review March 16, 2026 16:17

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds authentication-failure retry behavior (with backoff + circuit breaker + tracing) to the vMCP BackendClient, expands auth-error detection patterns, and introduces new test hooks/coverage to validate the behavior end-to-end.

Changes:

  • Wrap NewHTTPBackendClient with a retryingBackendClient decorator that retries ErrAuthenticationFailed with exponential backoff and a per-backend circuit breaker; instruments retries with OTel spans.
  • Refactor IsAuthenticationError to use a pattern list, including the mcp-go "unauthorized (401)" error format.
  • Extend integration/E2E test helpers and add new unit/integration/E2E tests for auth-retry and circuit-breaker behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/vmcp/client/auth_retry.go Implements auth retry loop with backoff, circuit breaker, singleflight coalescing, and OTel spans.
pkg/vmcp/client/client.go Wraps the HTTP backend client with the retry decorator; routes errors through wrapBackendError.
pkg/vmcp/errors.go Refactors auth error string matching into authErrorPatterns and adds additional patterns.
pkg/vmcp/client/auth_retry_test.go Unit tests for retry behavior, breaker open/reset, concurrency coalescing, and cancellation.
pkg/vmcp/client/auth_retry_integration_test.go Integration test validating transient-401 retry against a real mcp-go streamable HTTP server.
pkg/vmcp/client/client_test.go Updates tests to account for the new retry-wrapper returned by NewHTTPBackendClient.
test/integration/vmcp/helpers/backend.go Adds WithHTTPMiddleware for injecting transient HTTP behavior in integration tests.
test/e2e/thv-operator/virtualmcp/helpers.go Extends backend config to support args/pod template patches and skipping readiness wait; adds a shared health-status constant.
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go New E2E coverage validating persistent 401 leads to “unavailable” backend status.
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go Replaces hard-coded "healthy" with shared constant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 16, 2026
@codecov
Copy link

codecov bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 82.35294% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.20%. Comparing base (ca7f127) to head (b53818c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/client/client.go 43.33% 11 Missing and 6 partials ⚠️
pkg/vmcp/client/auth_retry.go 95.49% 4 Missing and 1 partial ⚠️
test/integration/vmcp/helpers/backend.go 28.57% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4169      +/-   ##
==========================================
- Coverage   69.30%   69.20%   -0.10%     
==========================================
  Files         466      467       +1     
  Lines       46774    46920     +146     
==========================================
+ Hits        32416    32471      +55     
+ Misses      11871    11861      -10     
- Partials     2487     2588     +101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an authentication-failure retry decorator (with exponential backoff, per-backend circuit breaker, singleflight coalescing, and OTel spans) around the vMCP BackendClient, and refactors error classification / test helpers to support the new behavior.

Changes:

  • Wrap NewHTTPBackendClient with a retryingBackendClient and introduce httpStatusRoundTripper to map HTTP 401/403 and 5xx into vmcp sentinel errors for errors.Is checks.
  • Refactor IsAuthenticationError/IsConnectionError to use package-level pattern slices and expand auth matching.
  • Add/adjust integration + E2E test helpers/tests to inject transient HTTP failures and validate auth-retry / circuit-breaker behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/vmcp/client/client.go Wraps the backend client with retry logic; adds HTTP status → sentinel error mapping; routes more call errors through wrapBackendError.
pkg/vmcp/client/auth_retry.go Implements auth-retry with exponential backoff, per-backend circuit breaker, singleflight coalescing, and OTel span instrumentation.
pkg/vmcp/errors.go Refactors string-based auth/connection error detection to pattern slices and expands auth patterns.
pkg/vmcp/client/client_test.go Updates tests to account for NewHTTPBackendClient now returning a retrying decorator.
pkg/vmcp/client/auth_retry_test.go Adds unit tests covering retry success/exhaustion, breaker open/reset, concurrency coalescing, and context cancellation.
pkg/vmcp/client/auth_retry_integration_test.go Adds integration test exercising retry behavior against an httptest-backed MCP server returning a transient 401.
test/integration/vmcp/helpers/backend.go Adds WithHTTPMiddleware to wrap backend HTTP handlers for injecting transient HTTP failures in integration tests.
test/e2e/thv-operator/virtualmcp/helpers.go Adds helpers/fields to support new E2E scenarios (pod-only readiness wait, backend args/pod template patching, skip readiness gate, health status constant).
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go Replaces literal "healthy" comparisons with shared constant.
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go Adds E2E test validating persistent-401 backend transitions to unavailable/degraded while stable backend remains healthy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

`NewHTTPBackendClient` now wraps the raw HTTP client in a
`retryingBackendClient` decorator that:

- Intercepts `ErrAuthenticationFailed` (401/403) returned by any
  `BackendClient` method and retries up to `maxAuthRetries` (3) times
  with exponential backoff (100 ms base, doubling each attempt).
- Maintains a per-backend `authCircuitBreaker` that opens after
  `authCircuitBreakerThreshold` (5) consecutive fully-exhausted retry
  sequences, preventing runaway latency from permanently broken
  credentials.
- Uses `singleflight` to coalesce concurrent backoff waits for the same
  backend, so N goroutines racing on a 401 sleep only once per attempt.
- Wraps each retry sequence in an OpenTelemetry `auth.retry` span so
  the overhead is visible in distributed traces.
- Never logs raw credentials.

`IsAuthenticationError` is refactored from chained `if` blocks into a
package-level `authErrorPatterns` slice (fixes gocyclo limit and adds
the mcp-go `"unauthorized (401)"` format that was previously missed).

The integration test helper gains `WithHTTPMiddleware` so tests can
inject transient HTTP errors without modifying the MCP server logic.

New tests:
- 9 unit tests (`auth_retry_test.go`) covering success, single-retry,
  max-retries, non-auth passthrough, circuit breaker open/reset, and
  context cancellation.
- 1 integration test (`auth_retry_integration_test.go`) against a real
  mcp-go streamable-HTTP server.
- 1 E2E Ginkgo suite (`virtualmcp_auth_retry_test.go`) deploying a
  Python 401 backend alongside a healthy yardstick backend and asserting
  that the failing backend is marked `BackendStatusUnavailable` while
  the stable backend remains ready.

Closes: #3869
@yrobla yrobla requested a review from Copilot March 17, 2026 12:33
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an auth-failure retry decorator around the vMCP BackendClient, converting HTTP status codes into sentinel errors to enable errors.Is()-based detection and supporting end-to-end verification of auth-retry exhaustion behavior.

Changes:

  • Wrap httpBackendClient with a new retryingBackendClient that retries ErrAuthenticationFailed with exponential backoff and a per-backend circuit breaker; coalesces concurrent backoff sleeps via singleflight and instruments retries with an auth.retry span.
  • Introduce an HTTP RoundTripper that turns 401/403 and selected 5xx responses into structured sentinel errors and updates backend error wrapping to prefer these sentinels.
  • Expand integration/E2E test helpers and add unit + integration + E2E tests to exercise transient/persistent 401 behavior and health reporting.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/vmcp/client/client.go Wraps the HTTP backend client with retrying decorator; adds httpStatusRoundTripper; updates error wrapping call sites to use shared wrapBackendError.
pkg/vmcp/client/auth_retry.go New retry decorator implementing auth retry/backoff, circuit breaker, singleflight coalescing, and OTel span instrumentation.
pkg/vmcp/errors.go Refactors auth/connection string matchers into pattern slices and expands auth patterns (incl. unauthorized (401)).
pkg/vmcp/client/client_test.go Updates tests to account for NewHTTPBackendClient now returning the retrying decorator.
pkg/vmcp/client/auth_retry_test.go New unit tests for retry success, max-retry exhaustion, circuit breaker behavior, and concurrency coalescing.
pkg/vmcp/client/auth_retry_integration_test.go New integration test validating transient 401 retry succeeds end-to-end with a real mcp-go server.
test/integration/vmcp/helpers/backend.go Adds WithHTTPMiddleware to allow injecting transient HTTP failures in integration tests.
test/e2e/thv-operator/virtualmcp/helpers.go Adds pod-only readiness wait helper; extends backend config to support args, pod template patching, and skipping readiness waits; adds backendHealthStatusHealthy constant.
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go Uses the new backendHealthStatusHealthy constant instead of hardcoded "healthy".
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go New E2E test deploying a persistent-401 backend and asserting it becomes unavailable while a stable backend remains healthy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 73 to +83
func NewHTTPBackendClient(registry vmcpauth.OutgoingAuthRegistry) (vmcp.BackendClient, error) {
if registry == nil {
return nil, fmt.Errorf("registry cannot be nil; use UnauthenticatedStrategy for no authentication")
}

c := &httpBackendClient{
registry: registry,
}
c.clientFactory = c.defaultClientFactory
return c, nil
return newRetryingBackendClient(c, registry), nil
}
Comment on lines +83 to +86
type retryingBackendClient struct {
inner vmcp.BackendClient
registry vmcpauth.OutgoingAuthRegistry

Copy link
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I think this potentially overlaps with the embedded auth in vMCP work: https://github.com/stacklok/stacklok-epics/issues/251

As part of that work, we want to solve the token refresh problem. It may make sense for those changes to land first.

Comment on lines +90 to +91
// breakers maps backendID -> *authCircuitBreaker. LoadOrStore is used for concurrent safety.
breakers sync.Map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocker: can we implement this without a sync.Map?

The multi-session already has the notion of separate clients per backend. We don't actually have to share state, right? If so, then we should have one retrier per backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[vMCP] Handle auth credential expiration with retry and proactive refresh

4 participants