feat(workflow-executor): use stored OAuth credentials for MCP steps [PRD-367]#1665
Conversation
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8d441d2 to
0e05efe
Compare
12 new issues
|
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d4555ee to
cf71d69
Compare
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cf71d69 to
55309c7
Compare
d42828e to
30b063c
Compare
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
55309c7 to
842d468
Compare
|
Agent code review — Claude Opus 4.8 (2026-06-18) Reviewed the folded-in commits Verified:
|
e121a1a to
270dbcc
Compare
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11aba95 to
1f58924
Compare
270dbcc to
2974d29
Compare
On the invalid_grant concurrent-rotation retry path the write-back used the pre-retry credential for the non-token fields; thread the credential whose token produced the grant through so a concurrent re-deposit is not partially reverted. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 403 is a permission/scope failure that a token refresh or re-consent cannot resolve, so it no longer triggers the refresh + re-auth flow (which looped) and instead surfaces as an ordinary failure. The spec specifies retry on 401. Addresses review on #1665. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cutor [PRD-367] PR1 wired an in-memory credential store + deposit endpoint into buildInMemoryExecutor, so the previous "in-memory raises ConfigurationError for oauth2 steps" behavior was inconsistent: a credential could be deposited but never used. Wire an OAuthTokenService into the in-memory runner (sharing the same store instance the deposit endpoint writes to) so oauth2 steps work end-to-end in dev, matching the database executor. The token service is now a required RunnerConfig/RemoteToolFetcher collaborator (both executors provide it), so the unreachable ConfigurationError guard and its fetcher test are removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tials port [PRD-367] The PR1 rebase moved the credentials store interface + types from stores/mcp-oauth-credentials-store to ports/mcp-oauth-credentials-store (the store file now holds only the Database/InMemory implementations). Import McpOAuthCredentialsStore and StoredMcpOAuthCredential from the new port path so the package compiles against the rebased PR1 base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion [PRD-367] PR1 dropped enc_key_version from the credential store (no version-aware decrypt path), so the rotation write-back no longer carries encKeyVersion. Per PRD-367 key-rotation handling, a decrypt failure with the encryption key PRESENT (auth-tag mismatch from a since-rotated/hard-swapped key) is recoverable: toGrantParams now classifies it as OAuthReauthRequiredError (needs-oauth-reauth) so re-consent re-deposits under the new key. A missing key (ExecutorEncryptionKeyMissingError) stays terminal — re-consent cannot help and a re-deposit would 503. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting [PRD-367] New GET /list-mcp-tools?mcpServerId= route on the executor HTTP server for the orchestrator-engine MCP-server details page: resolves the caller's vault credential (user_id from the validated JWT, never the request), refreshes, injects the Bearer, and returns the server's tool definitions — reusing RemoteToolFetcher, no new fetch/refresh logic. A missing/unrefreshable credential returns a typed needs-oauth-reauth (409), not a generic error or empty list. Wired into both the database and in-memory executors so oauth2 tool listing works in dev too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nses [PRD-367]
A literal JSON null (or other non-object) body from the token endpoint overwrote the {} parse default, so the subsequent payload.error / payload.access_token reads threw a TypeError instead of the typed OAuthRefreshError. Keep the {} default for non-object bodies so the status checks still surface a typed OAuthRefreshError.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nnect [PRD-367] On DELETE /mcp-oauth-credentials/:mcpServerId, evict the in-process cached access token for (user, server) so a disconnect takes effect immediately — otherwise the executor keeps serving the cached token until it expires even though the credential row is gone (surfaced by end-to-end testing). Wires OAuthTokenService into both executor builders + the HTTP server (optional on the options so credential-free tests need not construct one), and adds OAuthTokenService.evict with coverage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
63123b6 to
b6f316e
Compare
- list-mcp-tools: map ExecutorEncryptionKeyMissingError to the typed 503
{ code } the deposit endpoint already returns, so the details page shows
the admin message instead of a generic error.
- token-service: encrypt the rotated refresh token inside the best-effort
write-back try, so an encrypt failure can't fail an otherwise-valid
getAccessToken.
- ai-proxy isMcpAuthError: an explicit status is authoritative — a 403 is
not refreshable even when its message says "unauthorized"; fall back to
the message only when no status is present.
- Trim comments to why-not-how.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The re-read recovery path only handles a peer instance's rotation, not our own failed write-back (the stored token is unchanged, so re-read sees no rotation and forces re-auth). Describe the actual fallback: re-authentication. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…RD-624] The token endpoint is where the executor POSTs the refresh grant (with client credentials), but it was stored as an unconstrained string, so an authenticated caller could aim the executor at an internal address (SSRF, incl. cloud metadata). Add assertSafeTokenEndpoint, enforced at deposit (400) and again before the refresh POST (defence in depth for pre-existing rows): - scheme must be http(s); https required in production (http stays allowed off-prod for the local OAuth sim); - link-local / cloud-metadata (169.254.0.0/16, fe80::/10) blocked everywhere; - loopback blocked in production; - RFC1918 private ranges stay allowed — private OAuth providers are supported. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d [PRD-624] The URL parser normalizes `::ffff:127.0.0.1` to the hex form `::ffff:7f00:1`, which the IPv6 branch classified as neither loopback nor link-local — so a mapped loopback/metadata address slipped past the SSRF block in production. Extract the embedded IPv4 from `::ffff:` addresses and run it through the IPv4 loopback/link-local checks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…omment [PRD-624] Match the codebase convention (run-to-available-step-mapper.ts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop the local-test-tool and ticket references; explain the why (SSRF rationale, the deliberate RFC1918 allowance, why no DNS resolution) rather than enumerating the rules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt guard [PRD-624] 0.0.0.0 (and ::) connects to loopback on Linux, so it was an SSRF bypass the guard accepted. Classify the unspecified range (0.0.0.0/8, ::) and reject it on every environment — it is never a valid token endpoint. Surfaced by evaluating ipaddr.js, which flags it as `unspecified`; the WHATWG URL parser already normalizes the decimal/hex/short-form IP encodings to dotted form, so those were already covered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oken guard [PRD-624] Only the bare `localhost` was treated as loopback, so `localhost.` and the `.localhost` TLD (RFC 6761) — both resolving to loopback — slipped past the guard in production. Match any host whose last label is `localhost` (with or without a trailing dot), without over-blocking real domains like auth.localhost.example.com. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t guard [PRD-624] Regression test proving `127.0.0.1.` / `169.254.169.254.` are rejected: the guard runs `new URL()` first, which normalizes the trailing dot before classification, so the trailing-dot form is not a bypass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scra3
left a comment
There was a problem hiding this comment.
Review of the OAuth2 MCP executor runtime (PRD-367 PR2). One blocking security issue (SSRF via redirect on the token endpoint) plus a few non-blocking hardening points. Note: the tool-call re-auth idempotency-phase issue is already fixed in the stacked #1724, so it's not raised here.
| let response: Awaited<ReturnType<typeof fetch>>; | ||
|
|
||
| try { | ||
| response = await fetch(params.tokenEndpoint, { method: 'POST', headers, body }); |
There was a problem hiding this comment.
issue (blocking): fetch() follows redirects by default, so a deposited token endpoint that passes assertSafeTokenEndpoint but 3xx-redirects to 169.254.169.254/loopback re-sends the POST body to the internal host — leaking the refresh_token (and the client_secret in client_secret_post mode) and parsing the internal response as a token; pass redirect: 'manual' and treat any 3xx as an OAuthRefreshError.
| await store.delete(userId, ctx.params.mcpServerId); | ||
| // Evict any in-process cached access token so the disconnect is immediate, not deferred until | ||
| // the cached token expires (the row is gone, but the runtime would otherwise still serve it). | ||
| this.options.oauthTokenService?.evict(userId, ctx.params.mcpServerId); |
There was a problem hiding this comment.
issue (non-blocking): the POST deposit handler has no symmetric evict(), so re-depositing a credential (reconnect without a prior DELETE) keeps serving the previously cached access token until it expires — mirror this evict on the deposit path.
| mcpServerId, | ||
| ); | ||
|
|
||
| this.cache.set(key, { |
There was a problem hiding this comment.
issue (non-blocking): evict() runs outside the KeyedMutex, so a disconnect landing during an in-flight refresh no-ops on the empty cache and this cache.set then re-populates a just-deleted credential (served until expiry); it is also process-local, so on a multi-instance deployment other instances never drop the token — consider an eviction epoch checked before cache.set.
| throw new InvalidTokenEndpointError('it points at a link-local / metadata address'); | ||
| } | ||
|
|
||
| if (isProduction()) { |
There was a problem hiding this comment.
suggestion: the loopback block and the https requirement are gated on NODE_ENV === 'production', so a real deployment with NODE_ENV unset or misspelled silently allows loopback targets and cleartext http — prefer failing closed and opting into the relaxation for dev/test explicitly.

What & why
Executor-side runtime use of stored OAuth credentials for OAuth2-protected MCP servers (PRD-367, PR2). At an
oauth2MCP step the executor looks up the stored credential by(user, server), runs a refresh-token grant against the storedtoken_endpointbehind an expiry-skew cache, injects the Bearer token before connecting, retries once on an upstream 401 (covering both list-tools and the tool call), and pauses the stepawaiting-inputwithawaitingInputReason: needs-oauth-reauthwhen there is no usable credential or the refresh is rejected. Bearer/none steps are unchanged.Stacked PR / deploy ordering
feat/prd-367-pr1-executor-oauth-credentials(PR feat(workflow-executor): add OAuth credential store + deposit endpoint (PRD-367 PR1) #1619), notmain: PR2 depends on PR1's credential store / encryption / deposit endpoint, which is not yet merged. Review the diff against that branch.authType+ accepts the typedawaitingInputReason(PR1.5) and the frontend ships (PR3) — deploy the orchestrator first. Safe to deploy alone: the oauth2 path only activates onceauthType=oauth2is served; bearer/none is unchanged.Notable choices (large-PR annotation)
invalid_grant, not a Postgres row lock — the executor and token endpoints can sit behind a client VPN, so no DB lock is held across the refresh HTTP call. Row lock is the documented hardening path if a strict reuse-detection provider plus real cross-process contention appears.reloadWithFreshAuth) plus a typedOAuthReauthRequiredError -> awaiting-inputmapping; all token/credential/HTTP logic stays behindRemoteToolFetcherand the new OAuth token service.@forestadmin/ai-proxychange (consumed by the agent too) is additive-only (newloadToolsWithFailures/loadRemoteToolsWithFailures+mcp-auth-errorexport) plus behaviour-preserving delegation;authTypeis stripped in theMcpClientconstructor likeid.ConfigurationErrorfor oauth2 steps (no credential store wired).Tests
294 tests across the touched suites (ai-proxy 59 + workflow-executor 235); build, lint, tsc clean; >=90% line/stmt coverage on changed files.
Known limitation
A re-auth at tool-call time leaves the step at
idempotencyPhase=executing, so it needs a manual retry after re-auth; the common list-tools-time re-auth (pre-executor) resumes cleanly.Part of PRD-367.
🤖 Generated with Claude Code
Note
Add stored OAuth credential support for MCP steps in workflow executor
OAuthTokenServiceto retrieve, cache, and refresh OAuth access tokens per (userId, mcpServerId), with per-key serialization via a newKeyedMutexand rotation-aware retry logic.refreshAccessTokenandassertSafeTokenEndpointto perform the OAuth2 refresh-token grant and validate token endpoints against SSRF risks before any network call.RemoteToolFetcher.fetchto inject Bearer tokens for OAuth2 MCP servers, detect auth failures, and retry once with a forced refresh before surfacingOAuthReauthRequiredError.McpStepExecutorandStepExecutorFactoryso auth failures during tool invocation produce anawaiting-inputoutcome with reasonneeds-oauth-reauthinstead of failing the step.GET /list-mcp-toolsHTTP endpoint for design-time tool listing, with typed 409/503 responses for reauth and missing encryption key cases.POST /mcp-oauth-credentialsnow validatestokenEndpointsafety at deposit time;DELETEnow evicts cached tokens immediately.RunnerandExecutorHttpServerconstructor signatures have changed to require OAuth-related dependencies.Macroscope summarized 81b5607.