Skip to content

Remote streamable-http session goes stale → gateway returns silent "success" with empty results (writes lost), no error, no reconnect #505

Description

@gabiudrescu

I'm a product owner with enough technical depth to be dangerous, leaning on Claude to do the deeper digging - so treat the analysis below as "carefully investigated, but I haven't read your Go." I've been running the gateway in front of Atlassian's hosted MCP (some context on that setup here: https://productowner.ro/blog/almost-converted-mcp-json-to-toon/) and hit a reproducible failure that silently loses writes. Reporting it carefully because the failure mode is easy to miss.

TL;DR

After a remote MCP server's streamable-http session goes stale, the gateway keeps proxying calls but they return empty results reported as success - in tens of milliseconds, with no error logged anywhere, even at --verbose. Reads coming back blank is annoying; the real problem is on writes: an updateConfluencePage / editJiraIssue reports success but never actually applies the change - the upstream version doesn't increment, and the edit is lost with no signal. I've lost a Jira sprint write and two Confluence pushes this way before noticing.

/mcp reconnect (which respawns the gateway process) restores it for a while, then it recurs.

This is adjacent to #412 but distinct: #412 is the gateway as a server dropping client connections with a hard error. This is the gateway as a client to a remote upstream, where the failure is a silent empty success - worse, because nothing surfaces it.

Plain-language vs. technical

What I see What's actually happening
🔴 A write "succeeds" but didn't take. updateConfluencePage/editJiraIssue return success; upstream version unchanged. Silent data loss, no error.
🟠 Reads suddenly come back empty. Tool calls "completed successfully" but with empty content.
🟡 It happens after the gateway's been up a little while, under load. Works fine for ~10-15 min after a fresh start, then stops returning real data - frequently in the middle of a run of write calls.
⚪ Reconnecting "fixes" it temporarily. /mcp reconnect mints a fresh upstream session → healthy again for a while.

Environment

  • docker mcp plugin v0.42.1 (gateway advertises serverVersion Docker AI MCP Gateway 2.0.1 on the wire)
  • Docker Desktop 29.5.2, macOS (Apple Silicon)
  • Client: Claude Code (stdio transport to the gateway)
  • Remote server: Atlassian's hosted MCP - https://mcp.atlassian.com/v1/mcp, transport=streamable-http, authenticated via the gateway's built-in OAuth (docker mcp oauth authorize)
  • Gateway launched with --verbose and --servers <list incl. the remote>

Reproduction

  1. Add a remote streamable-http MCP server that uses the gateway's OAuth (I used atlassian-remote).
  2. Run the gateway and exercise the remote's tools normally (reads + writes) over ~10-20 minutes.
  3. Observe healthy calls completing in ~0.6-2s.
  4. After a while (for me ~11-15 min from a fresh process, sooner under a burst of writes), every call to that remote starts "completing successfully" in 40-70ms with empty content. Writes report success but don't commit.
  5. /mcp reconnect → healthy again for another ~10-15 min.

Evidence (from the Claude Code MCP debug log, identifiers scrubbed)

Healthy (fresh process), real upstream round-trips:

11:08:30  Calling MCP tool: <remote>__searchJiraIssuesUsingJql
11:08:31  completed successfully in 800ms
11:09:00  Calling MCP tool: <remote>__editJiraIssue
11:09:01  completed successfully in 1s

Wedged (~6 min later, same process), instant empty "successes":

11:15:38  Calling MCP tool: <remote>__searchJiraIssuesUsingJql
11:15:38  completed successfully in 67ms
11:15:46  ... 62ms
11:15:50  ... 58ms
11:17:30  Calling MCP tool: <remote>__updateConfluencePage
11:17:30  completed successfully in 61ms      <-- write "succeeded"; page version did NOT change
11:17:38  Calling MCP tool: <remote>__getConfluencePage
11:17:38  completed successfully in 48ms      <-- empty

Two things I think matter:

  • --verbose emits nothing per-request. The only stderr I get is the startup banner (config read, image pulls, tool listing, OAuth loop start). During the wedge: zero stderr - no 401, no "session not found", no HTTP status. So from the logs there's no error to act on; the gateway just returns empty.
  • OAuth refresh is running, yet it still wedges. The startup banner shows Starting OAuth notification monitor and Started OAuth provider loop for <remote>. So this doesn't look like access-token expiry (refresh is active) - it looks like the upstream streamable-http session itself being recycled/invalidated, with the gateway not re-establishing it and not detecting that responses are empty.

This lines up with the same root class reported elsewhere - server-side streamable-http session invalidation with no client-side re-establish: github/gh-aw#23153, anomalyco/opencode#25137, NousResearch/hermes-agent#13383.

What seems to be the right fix (and why I think it stalled)

The keepalive direction looks correct. A commit referenced from #412 - 71b0a90 "fix(sessions): pong response resets inactivity timer (MCP-spec liveness…)" - is exactly the shape I'd expect (don't let an idle session lapse). But as far as I can tell it never landed: the commit 404s (the fork it lived on is no longer reachable), there's no open PR carrying it, and it's in no release (latest is v0.42.2, whose only session-related change is "reuse containers per session" - local containers, not the remote upstream).

Two complementary asks, in priority order:

  1. Don't report empty upstream responses as success. When a streamable-http call to a remote returns empty/no-content where data is expected (or the session is gone), surface an error to the client instead of a silent empty success. Silent success on writes is the data-loss vector; even without a full reconnect fix, failing loudly would stop the bleeding.
  2. Re-establish a recycled upstream session (the keepalive/71b0a90 direction, plus reconnect-on-stale-session), so it doesn't wedge in the first place.

Confidence

High confidence on the symptom and timing, medium on the exact mechanism - the session-recycling call is inferred from behavior + logs, not source. Happy to provide a fuller --verbose log, test a patch against my setup (I can reproduce on demand), or help however's useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions