fix(subagent): persist + reconcile tool-result turns in the gateway loop#2274
Open
brettdavies wants to merge 1 commit into
Open
fix(subagent): persist + reconcile tool-result turns in the gateway loop#2274brettdavies wants to merge 1 commit into
brettdavies wants to merge 1 commit into
Conversation
… loop The gateway-native subagent loop (agent.use_gateway_loop, the path every non-Anthropic provider takes) persisted assistant tool-call turns but never the tool-result user turns. toolLoop pushed them to the in-memory array and threw the index away, and runSubagentViaGateway wired no callback to persist them. On any mid-conversation interruption (worker recycle, rate-lease renewal, RSS watchdog, abort), replay rebuilt an unbalanced conversation: assistant tool-calls with no matching tool-results. Unlike the legacy Anthropic path, the gateway path had no reconciliation, so the next chat() dispatch threw "Tool results are missing for tool calls ..." and poisoned every retry until the job dead-lettered. It surfaced as deterministic failure of `gbrain dream --phase patterns` on the litellm/codex-proxy stack at scale. This mirrors what the legacy Anthropic path already did: - gateway.toolLoop gains an onToolResultTurn callback, invoked when the tool-result user turn is built, so the caller persists it before the next dispatch. - runSubagentViaGateway persists that turn and reconciles on replay: when prior messages end with unanswered tool-calls, it re-synthesizes the tool-result turn from the settled subagent_tool_executions rows, and bails if any tool is still unsettled so a non-idempotent tool is never re-run. The existing crash-replay suite missed this because it stubs the chat transport, so the real validator never runs and message balance went unasserted. The new regression test captures the transport input and asserts the resumed conversation is balanced. It fails without this change and passes with it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2273
Summary
The bug. Subagent jobs that take the gateway-loop path (
agent.use_gateway_loop = true, used by every non-Anthropic recipe) die deterministically after any mid-conversation interruption with[chat(litellm:gpt-X.Y)] Tool results are missing for tool calls call_<id>, .... The failure scaled with parallel-tool-call count, which read as a provider-bridge bug but was not. The error came from the nextchat()dispatch operating on a stale in-memory message array, not from the wire response of the prior call.The fix.
gateway.toolLoopgains anonToolResultTurncallback fired when the tool-result user turn is built.runSubagentViaGatewaywires the callback to persist the turn before the next dispatch AND reconciles on replay: when prior messages end with unanswered tool-calls, it re-synthesizes the tool-result turn from the settledsubagent_tool_executionsrows. Bails if any tool is still unsettled, so a non-idempotent tool is never re-run.This mirrors the legacy Anthropic-direct path's existing reconciliation behavior; the gateway path was the only dispatch shape that lacked it.
Diagnosis
gateway.toolLoop(pre-patch) built the tool-result user turn in memory after the tool executions settled, then discarded the index.runSubagentViaGatewayhad no persistence hook, so the turn never reachedsubagent_messages.subagent_messageswithout the tool-result row. The shape was unbalanced: assistanttool_callturns with no matchingtooluser turns.chat()dispatch hit the provider's validator with the unbalanced shape; provider returned the "Tool results are missing for tool calls ..." error; worker treated as hard failure, retried the same unbalanced shape, dead-lettered the job.src/core/minions/handlers/subagent.tsalready persists the user-turn reconciliation; the gateway path was simply never extended.Reproduction
agent.use_gateway_loop = true.gbrain dream --phase patternsis the deterministic local trigger).kill -TERMthe worker, or let the RSS watchdog recycle the process.chat()throwsTool results are missing for tool calls call_<id>, .... Job dead-letters.Observed on gbrain 0.42.43.0 through 0.42.51.0.
Tests
test/e2e/subagent-gateway-toolresult-replay.test.ts: captures the real transport input shape and asserts the resumed conversation is balanced. Fails on master without the patch; passes with it. ~5s runtime.bun run typecheckclean.bun run verifygreen.Adjacent observation (not in this PR)
agent.use_gateway_loopis missing fromKNOWN_CONFIG_KEYS(operators have to discover the--forceescape hatch from the error message). And with the gateway path's recipe coverage now broad (every non-Anthropic provider plus Anthropic itself vianative-anthropic), the default could flip fromfalsetotrue. Both worth follow-up discussion; not in scope here.