fix(subagent): reliable gateway tool-loop for non-Anthropic providers (persist tool-result messages + JSON-normalize outputs) by rafaelreis-r · Pull Request #2257 · garrytan/gbrain

rafaelreis-r · 2026-06-17T15:24:23Z

The gateway-native subagent loop (agent.use_gateway_loop=true, mandatory for non-Anthropic providers) couldn't reliably complete a multi-turn tool conversation. Two independent bugs combined to make subagent jobs loop on retries forever. Both surface as AI SDK v6 errors: Invalid prompt: The messages do not match the ModelMessage[] schema and Tool results are missing for tool calls <ids>. The Anthropic-direct path is unaffected.

Commit 1 — persist tool-result messages; self-heal on resume

toolLoop() fed each turn's tool results back as a role:'user' message but never persisted it (void userMessageIdx), and runSubagentViaGateway discarded ToolLoopResult.messages. So subagent_messages had gaps at every even message_idx. On any resume, loadPriorMessages rebuilt a history of adjacent assistant turns with dangling tool-calls, which AI SDK v6 rejects — before the replayState.priorTools reconciler (which only runs inside tool-dispatch, after chat()) ever gets a chance. The job then looped to max_attempts. The legacy Anthropic-direct path always persisted this message.

toolLoop: new onToolResultMessage callback, fired before the in-memory push (write-before-use).
runSubagentViaGateway: wires it to persist the tool-result message; and rebuildReplayHistory() self-heals already-corrupted jobs by reconstructing each missing tool-result message from settled subagent_tool_executions rows (matched by provider toolCallId) and re-persisting it. Conservative: a still-pending/absent exec leaves the turn un-healed rather than fabricating data. nextMessageIdx now derives from max(known idx)+1 so a healed gap can't collide under the unique (job_id, message_idx) constraint.

Commit 2 — JSON-normalize tool-result outputs

AI SDK v6 validates each tool-result json output against a strict JSONValue schema. node-postgres returns timestamptz as JS Date, so brain_get_page/brain_list_pages rows carry Date-typed updated_at/created_at; a raw Date (also undefined/bigint) fails the ModelMessage union the turn after such a tool runs. Replay masked it (the jsonb round-trip had already stringified the Date).

toModelMessages: normalize the json value via toJsonValue() (JSON round-trip: Date→ISO, undefined dropped, non-serializable→string), plus a non-throwing safeStringify() for the error-text branch. Persisted blocks are unaffected (write path already JSON-stringifies, so stored and sent forms now agree).

Tests

test/e2e/subagent-gateway-resume.test.ts — forward persistence (contiguous message_idx, no gaps) + self-healing resume from a pre-fix corrupted job without re-executing the tool.
test/gateway-model-messages.test.ts — Date→ISO normalization, undefined drop, and a real generateText validation accepting a Date-bearing tool output (no ModelMessage rejection).
Existing subagent-gateway-path and cross-provider subagent-crash-replay-multi-provider suites stay green (33 pass on this branch).

Validation

Deployed against a live llama-server:qwen36b dream backlog where 23/23 multi-turn jobs were stuck looping. After the fix, jobs that previously failed every retry run the full multi-turn synthesis to completion with contiguous message_idx (verified one job: 16 messages, idx 0–15, real synthesis output; the rest drained the same way). The only non-completions were pre-existing wall-clock timeout deaths on the heaviest transcripts that had already exhausted most of their retry budget on the old bug — not a regression.

…-fix histories on resume The gateway-native subagent loop (agent.use_gateway_loop, mandatory for non-Anthropic models) persisted assistant turns and per-tool execution rows but never persisted the user-role message carrying the tool results back to the model (`void userMessageIdx`). Within a single uninterrupted run this is invisible — the tool-result messages live in the in-memory array. But on any resume (worker restart / re-claim), loadPriorMessages rebuilt the history from subagent_messages alone, which was missing every tool-result message, leaving adjacent assistant turns with dangling tool-calls. AI SDK v6 then rejects the prompt — "Tool results are missing for tool calls ..." or "messages do not match the ModelMessage[] schema" — and the job loops to max_attempts. The replayState.priorTools reconciler never gets a chance: chat() throws on the malformed history before the tool-dispatch section runs. The legacy Anthropic-direct path always persisted this message; only the gateway path dropped it. Fix: - toolLoop: add onToolResultMessage callback, fired before the in-memory push (write-before-use), wired in runSubagentViaGateway to persist the user-role tool-result message. No more message_idx gaps going forward. - runSubagentViaGateway: rebuildReplayHistory() self-heals already-corrupted jobs by reconstructing each missing tool-result message from settled subagent_tool_executions rows (matched by provider toolCallId) and re-persisting it. Conservative: a still-pending/absent exec leaves the turn un-healed rather than fabricating a result. nextMessageIdx now derives from max(known idx)+1 instead of array length so a healed gap can't collide. Tests: test/e2e/subagent-gateway-resume.test.ts (forward persistence + self- healing resume without tool re-execution). Existing gateway-path and cross-provider crash-replay suites stay green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… them AI SDK v6 validates each tool-result `json` output against a strict JSONValue Zod schema. gbrain tool outputs routinely carry non-JSON values: node-postgres returns `timestamptz` columns as JS `Date`, so brain_get_page / brain_list_pages rows include Date-typed updated_at/created_at fields. A raw Date (also undefined / bigint) makes the entire tool message fail the ModelMessage union, and generateText throws "Invalid prompt: The messages do not match the ModelMessage[] schema" on the turn after such a tool runs. This fired on every live multi-turn run with a non-Anthropic (gateway-path) model. Crash-replay masked it: the value had already been round-tripped through the jsonb column to an ISO string, so a resumed run validated fine — until the next live tool call hit a fresh Date again. (This was the "attempt 1" failure shadowed by the separate tool-result-message persistence bug.) Fix: normalize the tool-result `json` value via toJsonValue() (JSON round-trip: Date -> ISO string, undefined dropped, non-serializable -> string fallback) at the toModelMessages boundary, plus a non-throwing safeStringify() for the error-text branch. Persisted blocks are unaffected (persistMessage already JSON-stringifies on write, so stored and sent forms now agree). Tests: test/gateway-model-messages.test.ts — Date->ISO normalization, undefined drop, and a real AI SDK generateText validation accepting a Date-bearing tool output (no ModelMessage rejection). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Rafael Reis and others added 2 commits June 17, 2026 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subagent): reliable gateway tool-loop for non-Anthropic providers (persist tool-result messages + JSON-normalize outputs)#2257

fix(subagent): reliable gateway tool-loop for non-Anthropic providers (persist tool-result messages + JSON-normalize outputs)#2257
rafaelreis-r wants to merge 2 commits into
garrytan:masterfrom
rafaelreis-r:pr/gateway-subagent-reliability

rafaelreis-r commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rafaelreis-r commented Jun 17, 2026

Commit 1 — persist tool-result messages; self-heal on resume

Commit 2 — JSON-normalize tool-result outputs

Tests

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant