Skip to content

fix(subagent): persist + reconcile tool-result turns in the gateway loop#2274

Open
brettdavies wants to merge 1 commit into
garrytan:masterfrom
brettdavies:fix/gateway-loop-tool-result-turns-clean
Open

fix(subagent): persist + reconcile tool-result turns in the gateway loop#2274
brettdavies wants to merge 1 commit into
garrytan:masterfrom
brettdavies:fix/gateway-loop-tool-result-turns-clean

Conversation

@brettdavies

Copy link
Copy Markdown

Closes #2273

Summary

The bug. Subagent jobs that take the gateway-loop path (agent.use_gateway_loop = true, used by every non-Anthropic recipe) die deterministically after any mid-conversation interruption with [chat(litellm:gpt-X.Y)] Tool results are missing for tool calls call_<id>, .... The failure scaled with parallel-tool-call count, which read as a provider-bridge bug but was not. The error came from the next chat() dispatch operating on a stale in-memory message array, not from the wire response of the prior call.

The fix. gateway.toolLoop gains an onToolResultTurn callback fired when the tool-result user turn is built. runSubagentViaGateway wires the callback to persist the turn before the next dispatch AND reconciles on replay: when prior messages end with unanswered tool-calls, it re-synthesizes the tool-result turn from the settled subagent_tool_executions rows. Bails if any tool is still unsettled, so a non-idempotent tool is never re-run.

This mirrors the legacy Anthropic-direct path's existing reconciliation behavior; the gateway path was the only dispatch shape that lacked it.

Diagnosis

  • gateway.toolLoop (pre-patch) built the tool-result user turn in memory after the tool executions settled, then discarded the index. runSubagentViaGateway had no persistence hook, so the turn never reached subagent_messages.
  • On any interruption (worker recycle, rate-lease renewal, RSS watchdog, abort, clean shutdown), replay rebuilt the conversation from subagent_messages without the tool-result row. The shape was unbalanced: assistant tool_call turns with no matching tool user turns.
  • The next chat() dispatch hit the provider's validator with the unbalanced shape; provider returned the "Tool results are missing for tool calls ..." error; worker treated as hard failure, retried the same unbalanced shape, dead-lettered the job.
  • The legacy Anthropic-direct path in src/core/minions/handlers/subagent.ts already persists the user-turn reconciliation; the gateway path was simply never extended.

Reproduction

  1. Brain configured for any non-Anthropic chat provider (litellm-proxy, codex-proxy, ollama, openrouter, openai). Set agent.use_gateway_loop = true.
  2. Submit a subagent job whose dispatch fires parallel tool calls (gbrain dream --phase patterns is the deterministic local trigger).
  3. Interrupt mid-conversation: kill -TERM the worker, or let the RSS watchdog recycle the process.
  4. The worker re-claims the job and replays. Replay rebuilds the conversation from persisted rows.
  5. Next chat() throws Tool results are missing for tool calls call_<id>, .... Job dead-letters.

Observed on gbrain 0.42.43.0 through 0.42.51.0.

Tests

  • New test/e2e/subagent-gateway-toolresult-replay.test.ts: captures the real transport input shape and asserts the resumed conversation is balanced. Fails on master without the patch; passes with it. ~5s runtime.
  • Verified end-to-end on a production-shaped brain (litellm + codex-proxy + gpt-5.5). Pre-patch: every interruption dead-lettered the active jobs; post-patch: every interruption resumed cleanly across worker recycles, RSS-watchdog kills, and clean restarts.
  • bun run typecheck clean. bun run verify green.

Adjacent observation (not in this PR)

agent.use_gateway_loop is missing from KNOWN_CONFIG_KEYS (operators have to discover the --force escape hatch from the error message). And with the gateway path's recipe coverage now broad (every non-Anthropic provider plus Anthropic itself via native-anthropic), the default could flip from false to true. Both worth follow-up discussion; not in scope here.

… loop

The gateway-native subagent loop (agent.use_gateway_loop, the path every non-Anthropic provider takes) persisted assistant tool-call turns but never the tool-result user turns. toolLoop pushed them to the in-memory array and threw the index away, and runSubagentViaGateway wired no callback to persist them. On any mid-conversation interruption (worker recycle, rate-lease renewal, RSS watchdog, abort), replay rebuilt an unbalanced conversation: assistant tool-calls with no matching tool-results. Unlike the legacy Anthropic path, the gateway path had no reconciliation, so the next chat() dispatch threw "Tool results are missing for tool calls ..." and poisoned every retry until the job dead-lettered. It surfaced as deterministic failure of `gbrain dream --phase patterns` on the litellm/codex-proxy stack at scale.

This mirrors what the legacy Anthropic path already did:

- gateway.toolLoop gains an onToolResultTurn callback, invoked when the tool-result user turn is built, so the caller persists it before the next dispatch.
- runSubagentViaGateway persists that turn and reconciles on replay: when prior messages end with unanswered tool-calls, it re-synthesizes the tool-result turn from the settled subagent_tool_executions rows, and bails if any tool is still unsettled so a non-idempotent tool is never re-run.

The existing crash-replay suite missed this because it stubs the chat transport, so the real validator never runs and message balance went unasserted. The new regression test captures the transport input and asserts the resumed conversation is balanced. It fails without this change and passes with it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway-loop subagent path never persists tool-result user turns — every resume rebuilds an unbalanced conversation that next chat() rejects

1 participant