fix(subagent): persist + reconcile tool-result turns in the gateway loop by brettdavies · Pull Request #2274 · garrytan/gbrain

brettdavies · 2026-06-18T16:21:22Z

Summary

The bug. Subagent jobs that take the gateway-loop path (agent.use_gateway_loop = true, used by every non-Anthropic recipe) die deterministically after any mid-conversation interruption with [chat(litellm:gpt-X.Y)] Tool results are missing for tool calls call_<id>, .... The failure scaled with parallel-tool-call count, which read as a provider-bridge bug but was not. The error came from the next chat() dispatch operating on a stale in-memory message array, not from the wire response of the prior call.

The fix. gateway.toolLoop gains an onToolResultTurn callback fired when the tool-result user turn is built. runSubagentViaGateway wires the callback to persist the turn before the next dispatch AND reconciles on replay: when prior messages end with unanswered tool-calls, it re-synthesizes the tool-result turn from the settled subagent_tool_executions rows. Bails if any tool is still unsettled, so a non-idempotent tool is never re-run.

This mirrors the legacy Anthropic-direct path's existing reconciliation behavior; the gateway path was the only dispatch shape that lacked it.

Diagnosis

gateway.toolLoop (pre-patch) built the tool-result user turn in memory after the tool executions settled, then discarded the index. runSubagentViaGateway had no persistence hook, so the turn never reached subagent_messages.
On any interruption (worker recycle, rate-lease renewal, RSS watchdog, abort, clean shutdown), replay rebuilt the conversation from subagent_messages without the tool-result row. The shape was unbalanced: assistant tool_call turns with no matching tool user turns.
The next chat() dispatch hit the provider's validator with the unbalanced shape; provider returned the "Tool results are missing for tool calls ..." error; worker treated as hard failure, retried the same unbalanced shape, dead-lettered the job.
The legacy Anthropic-direct path in src/core/minions/handlers/subagent.ts already persists the user-turn reconciliation; the gateway path was simply never extended.

Reproduction

Brain configured for any non-Anthropic chat provider (litellm-proxy, codex-proxy, ollama, openrouter, openai). Set agent.use_gateway_loop = true.
Submit a subagent job whose dispatch fires parallel tool calls (gbrain dream --phase patterns is the deterministic local trigger).
Interrupt mid-conversation: kill -TERM the worker, or let the RSS watchdog recycle the process.
The worker re-claims the job and replays. Replay rebuilds the conversation from persisted rows.
Next chat() throws Tool results are missing for tool calls call_<id>, .... Job dead-letters.

Observed on gbrain 0.42.43.0 through 0.42.51.0.

Tests

New test/e2e/subagent-gateway-toolresult-replay.test.ts: captures the real transport input shape and asserts the resumed conversation is balanced. Fails on master without the patch; passes with it. ~5s runtime.
Verified end-to-end on a production-shaped brain (litellm + codex-proxy + gpt-5.5). Pre-patch: every interruption dead-lettered the active jobs; post-patch: every interruption resumed cleanly across worker recycles, RSS-watchdog kills, and clean restarts.
bun run typecheck clean. bun run verify green.

Adjacent observation (not in this PR)

agent.use_gateway_loop is missing from KNOWN_CONFIG_KEYS (operators have to discover the --force escape hatch from the error message). And with the gateway path's recipe coverage now broad (every non-Anthropic provider plus Anthropic itself via native-anthropic), the default could flip from false to true. Both worth follow-up discussion; not in scope here.

… loop The gateway-native subagent loop (agent.use_gateway_loop, the path every non-Anthropic provider takes) persisted assistant tool-call turns but never the tool-result user turns. toolLoop pushed them to the in-memory array and threw the index away, and runSubagentViaGateway wired no callback to persist them. On any mid-conversation interruption (worker recycle, rate-lease renewal, RSS watchdog, abort), replay rebuilt an unbalanced conversation: assistant tool-calls with no matching tool-results. Unlike the legacy Anthropic path, the gateway path had no reconciliation, so the next chat() dispatch threw "Tool results are missing for tool calls ..." and poisoned every retry until the job dead-lettered. It surfaced as deterministic failure of `gbrain dream --phase patterns` on the litellm/codex-proxy stack at scale. This mirrors what the legacy Anthropic path already did: - gateway.toolLoop gains an onToolResultTurn callback, invoked when the tool-result user turn is built, so the caller persists it before the next dispatch. - runSubagentViaGateway persists that turn and reconciles on replay: when prior messages end with unanswered tool-calls, it re-synthesizes the tool-result turn from the settled subagent_tool_executions rows, and bails if any tool is still unsettled so a non-idempotent tool is never re-run. The existing crash-replay suite missed this because it stubs the chat transport, so the real validator never runs and message balance went unasserted. The new regression test captures the transport input and asserts the resumed conversation is balanced. It fails without this change and passes with it.

brettdavies mentioned this pull request Jun 18, 2026

gateway-loop subagent path never persists tool-result user turns — every resume rebuilds an unbalanced conversation that next chat() rejects #2273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subagent): persist + reconcile tool-result turns in the gateway loop#2274

fix(subagent): persist + reconcile tool-result turns in the gateway loop#2274
brettdavies wants to merge 1 commit into
garrytan:masterfrom
brettdavies:fix/gateway-loop-tool-result-turns-clean

brettdavies commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brettdavies commented Jun 18, 2026

Summary

Diagnosis

Reproduction

Tests

Adjacent observation (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant