Sec #6: Prevent capability shadowing#517
Conversation
arielho-docker
left a comment
There was a problem hiding this comment.
Correctness pass
Read both reload paths, the validator, the capability types, and all four call sites. The validator logic is sound and fail-closed — both validations run before any state mutation in reloadServerCapabilities, so a collision returns early with the server untouched, and callers log/return or surface to the user. Self-exclusion (registeredCapabilityNameIndexes(serverName)), policy-before-collision ordering, reserved-prompt gating on DynamicTools, deterministic error ordering, and cross-type namespace independence all check out.
Two findings inline:
- High: validate (in
reloadServerCapabilities) and register (inupdateServerCapabilities) run under two separate lock acquisitions, and the new prompt/resource/template index keys offserverCapabilities, which is only updated in the second — so concurrent reloads of distinct servers introducing the same new identifier can both pass and the later one silently shadows. Cheapest fix: re-validate insideupdateServerCapabilities. - Low: the validator trims identifiers for comparison while the SDK keys on the raw string (over-rejection, errs safe).
Also worth noting in the description: routing prompt-policy filtering through the dynamic-reload path is a real behavior change (that path previously filtered only tools), not just a refactor.
🤖 Reviewed with Claude Code
| if err := validateExternalToolNameCollisions(allowedServerCaps.Tools, existingToolRegistrations); err != nil { | ||
| return nil, err | ||
| } | ||
| if err := validateExternalCapabilityNameCollisions(allowedServerCaps, g.registeredCapabilityNameIndexes(serverName), g.DynamicTools); err != nil { |
There was a problem hiding this comment.
High — validation and registration are split across two lock acquisitions, so this check can still be raced.
This validation runs while reloadServerCapabilities holds capabilitiesMu (acquired at L403, released on return). But the actual AddPrompt/AddResource/AddResourceTemplate happens later in updateServerCapabilities, under a separate lock — see every caller: run.go:584→591, run.go:720→726, mcpadd.go:235→254, activateprofile.go:298→307. The lock is released in between, and updateServerCapabilities does not re-validate before registering.
Concrete race — two concurrent dynamic reloads of servers A and B both introducing prompt "summarize" (or the same resource URI / template):
- A validates against
serverCapabilities(no B) → passes; releases lock; A has not writtenserverCapabilitiesyet. - B validates against
serverCapabilities(no A — A only wroteserverAvailableCapabilities) → passes. - A's
updateServerCapabilitiesregisters"summarize", setsserverCapabilities[A]. - B's
updateServerCapabilitiesregisters"summarize"→ silently shadows A — exactly what this PR is meant to prevent.
Window is narrow (concurrent reloads of distinct servers introducing the identical new identifier — e.g. two list_changed firing together, or mcp-add racing a notification), and the validate/register split is inherited from #513. But see the note on registeredCapabilityNameIndexes below — the new checks key off state updated later than the tool check does, so they have a strictly wider window.
Cheapest fix: re-validate inside updateServerCapabilities under its lock, immediately before the Add* calls. (Alternatives: record a pending claim into serverCapabilities/a side index within this lock, or hold one contiguous lock across validate→register.)
There was a problem hiding this comment.
Fixed in 1cddacf8 by re-validating inside updateServerCapabilities while capabilitiesMu is held, immediately before removals/adds. That closes the split validate/register window for concurrent dynamic reloads because the second registrar now checks against any server that committed serverCapabilities after the earlier reloadServerCapabilities pass.
I also added TestUpdateServerCapabilitiesRevalidatesCapabilityCollisionsUnderLock for this path.
| ResourceTemplates: make(map[string]string), | ||
| } | ||
|
|
||
| for serverName, caps := range g.serverCapabilities { |
There was a problem hiding this comment.
This index is built from g.serverCapabilities, which is written only in updateServerCapabilities (L585) — after the validating lock in reloadServerCapabilities is released. That's what widens the race in the comment on L426: the tool collision check reads g.toolRegistrations, which reloadServerCapabilities updates inside its own lock (L434–446), so a concurrent tool reload sees pending tools. The prompt/resource/template tracking has no such in-lock update, so a concurrent reload validates against a serverCapabilities snapshot that is missing the other reload's pending capabilities.
If you adopt the re-validate-in-updateServerCapabilities fix this becomes moot; otherwise consider recording the claim here (under this lock) so a concurrent validation observes it.
There was a problem hiding this comment.
Addressed by the same updateServerCapabilities revalidation in 1cddacf8. The earlier registeredCapabilityNameIndexes call is still useful for fast fail-before-claim behavior, but the decisive check now also runs at the registration boundary under the lock, after any competing reload has had a chance to update serverCapabilities.
| seen := make(map[string]capabilityIdentityRegistration, len(registrations)) | ||
|
|
||
| for _, registration := range sortedCapabilityIdentityRegistrations(registrations) { | ||
| identifier := strings.TrimSpace(registration.identifier) |
There was a problem hiding this comment.
Low — the validator trims for comparison, but the SDK keys on the raw identifier.
identifier is the trimmed value, and it's used for the reserved / seen / existing lookups — but registration (AddPrompt/AddResource/…) uses the raw, untrimmed name as the SDK key. So the validator's notion of "equal" differs from the SDK's:
"x"and" x"are flagged as colliding even though the SDK treats them as distinct keys (false positive — rejects a valid config)." mcp-discover "is rejected as reserved although it wouldn't actually shadow the gateway'smcp-discover.
It errs toward rejecting, so it's not a security hole — but it should key on the exact value the SDK registers. Keep the empty-after-trim check as a defensive guard, but do the reserved/seen/existing comparisons on the raw registration.identifier (or trim consistently at registration time too).
There was a problem hiding this comment.
Fixed in 1cddacf8. The validator still rejects identifiers that are empty after trimming, but reserved/seen/existing lookups now use the raw registration.identifier, matching the key the SDK registers.
Added TestValidateExternalCapabilityNameCollisionsUsesRawIdentifiersForComparison to cover "summarize" vs " summarize " and " mcp-discover ".
Summary
mcp-discoverwhen dynamic tools are enabledBehavior note
Prompt policy filtering now also runs on the dynamic-reload path before prompt collision validation. Previously that path filtered tools only, so denied prompts could still remain in dynamic available capabilities.
Root cause
PR #513 protected exposed tool names, but the MCP SDK also replaces prompts, resources, and resource templates when the same prompt name, resource URI, or resource template URI template is registered. The gateway did not validate those identifiers, so later capabilities could shadow earlier registrations.
Validation
go test ./pkg/gatewaygo test ./pkg/gateway -run 'TestValidateExternalCapabilityNameCollisions|TestUpdateServerCapabilitiesRevalidatesCapabilityCollisionsUnderLock'go test ./...fails in existing integration areas in this local environment:TestIntegrationWithElicitationandTestIntegrationShortLivedContainerCloseshit MCP initialize EOFs, andTestIntegrationCallToolClickhousereports Dockerunknown flag: --gateway-arg.