Skip to content

fix(subagent): reliable gateway tool-loop for non-Anthropic providers (persist tool-result messages + JSON-normalize outputs)#2257

Open
rafaelreis-r wants to merge 2 commits into
garrytan:masterfrom
rafaelreis-r:pr/gateway-subagent-reliability
Open

fix(subagent): reliable gateway tool-loop for non-Anthropic providers (persist tool-result messages + JSON-normalize outputs)#2257
rafaelreis-r wants to merge 2 commits into
garrytan:masterfrom
rafaelreis-r:pr/gateway-subagent-reliability

Conversation

@rafaelreis-r

Copy link
Copy Markdown

Fixes #2256.

The gateway-native subagent loop (agent.use_gateway_loop=true, mandatory for non-Anthropic providers) couldn't reliably complete a multi-turn tool conversation. Two independent bugs combined to make subagent jobs loop on retries forever. Both surface as AI SDK v6 errors: Invalid prompt: The messages do not match the ModelMessage[] schema and Tool results are missing for tool calls <ids>. The Anthropic-direct path is unaffected.

Commit 1 — persist tool-result messages; self-heal on resume

toolLoop() fed each turn's tool results back as a role:'user' message but never persisted it (void userMessageIdx), and runSubagentViaGateway discarded ToolLoopResult.messages. So subagent_messages had gaps at every even message_idx. On any resume, loadPriorMessages rebuilt a history of adjacent assistant turns with dangling tool-calls, which AI SDK v6 rejects — before the replayState.priorTools reconciler (which only runs inside tool-dispatch, after chat()) ever gets a chance. The job then looped to max_attempts. The legacy Anthropic-direct path always persisted this message.

  • toolLoop: new onToolResultMessage callback, fired before the in-memory push (write-before-use).
  • runSubagentViaGateway: wires it to persist the tool-result message; and rebuildReplayHistory() self-heals already-corrupted jobs by reconstructing each missing tool-result message from settled subagent_tool_executions rows (matched by provider toolCallId) and re-persisting it. Conservative: a still-pending/absent exec leaves the turn un-healed rather than fabricating data. nextMessageIdx now derives from max(known idx)+1 so a healed gap can't collide under the unique (job_id, message_idx) constraint.

Commit 2 — JSON-normalize tool-result outputs

AI SDK v6 validates each tool-result json output against a strict JSONValue schema. node-postgres returns timestamptz as JS Date, so brain_get_page/brain_list_pages rows carry Date-typed updated_at/created_at; a raw Date (also undefined/bigint) fails the ModelMessage union the turn after such a tool runs. Replay masked it (the jsonb round-trip had already stringified the Date).

  • toModelMessages: normalize the json value via toJsonValue() (JSON round-trip: Date→ISO, undefined dropped, non-serializable→string), plus a non-throwing safeStringify() for the error-text branch. Persisted blocks are unaffected (write path already JSON-stringifies, so stored and sent forms now agree).

Tests

  • test/e2e/subagent-gateway-resume.test.ts — forward persistence (contiguous message_idx, no gaps) + self-healing resume from a pre-fix corrupted job without re-executing the tool.
  • test/gateway-model-messages.test.tsDate→ISO normalization, undefined drop, and a real generateText validation accepting a Date-bearing tool output (no ModelMessage rejection).
  • Existing subagent-gateway-path and cross-provider subagent-crash-replay-multi-provider suites stay green (33 pass on this branch).

Validation

Deployed against a live llama-server:qwen36b dream backlog where 23/23 multi-turn jobs were stuck looping. After the fix, jobs that previously failed every retry run the full multi-turn synthesis to completion with contiguous message_idx (verified one job: 16 messages, idx 0–15, real synthesis output; the rest drained the same way). The only non-completions were pre-existing wall-clock timeout deaths on the heaviest transcripts that had already exhausted most of their retry budget on the old bug — not a regression.

Rafael Reis and others added 2 commits June 17, 2026 08:21
…-fix histories on resume

The gateway-native subagent loop (agent.use_gateway_loop, mandatory for
non-Anthropic models) persisted assistant turns and per-tool execution rows
but never persisted the user-role message carrying the tool results back to
the model (`void userMessageIdx`). Within a single uninterrupted run this is
invisible — the tool-result messages live in the in-memory array. But on any
resume (worker restart / re-claim), loadPriorMessages rebuilt the history from
subagent_messages alone, which was missing every tool-result message, leaving
adjacent assistant turns with dangling tool-calls. AI SDK v6 then rejects the
prompt — "Tool results are missing for tool calls ..." or "messages do not
match the ModelMessage[] schema" — and the job loops to max_attempts. The
replayState.priorTools reconciler never gets a chance: chat() throws on the
malformed history before the tool-dispatch section runs.

The legacy Anthropic-direct path always persisted this message; only the
gateway path dropped it.

Fix:
  - toolLoop: add onToolResultMessage callback, fired before the in-memory
    push (write-before-use), wired in runSubagentViaGateway to persist the
    user-role tool-result message. No more message_idx gaps going forward.
  - runSubagentViaGateway: rebuildReplayHistory() self-heals already-corrupted
    jobs by reconstructing each missing tool-result message from settled
    subagent_tool_executions rows (matched by provider toolCallId) and
    re-persisting it. Conservative: a still-pending/absent exec leaves the turn
    un-healed rather than fabricating a result. nextMessageIdx now derives from
    max(known idx)+1 instead of array length so a healed gap can't collide.

Tests: test/e2e/subagent-gateway-resume.test.ts (forward persistence + self-
healing resume without tool re-execution). Existing gateway-path and
cross-provider crash-replay suites stay green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… them

AI SDK v6 validates each tool-result `json` output against a strict JSONValue
Zod schema. gbrain tool outputs routinely carry non-JSON values: node-postgres
returns `timestamptz` columns as JS `Date`, so brain_get_page / brain_list_pages
rows include Date-typed updated_at/created_at fields. A raw Date (also undefined
/ bigint) makes the entire tool message fail the ModelMessage union, and
generateText throws "Invalid prompt: The messages do not match the
ModelMessage[] schema" on the turn after such a tool runs.

This fired on every live multi-turn run with a non-Anthropic (gateway-path)
model. Crash-replay masked it: the value had already been round-tripped through
the jsonb column to an ISO string, so a resumed run validated fine — until the
next live tool call hit a fresh Date again. (This was the "attempt 1" failure
shadowed by the separate tool-result-message persistence bug.)

Fix: normalize the tool-result `json` value via toJsonValue() (JSON round-trip:
Date -> ISO string, undefined dropped, non-serializable -> string fallback) at
the toModelMessages boundary, plus a non-throwing safeStringify() for the
error-text branch. Persisted blocks are unaffected (persistMessage already
JSON-stringifies on write, so stored and sent forms now agree).

Tests: test/gateway-model-messages.test.ts — Date->ISO normalization, undefined
drop, and a real AI SDK generateText validation accepting a Date-bearing tool
output (no ModelMessage rejection).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway subagent loop unreliable for multi-turn tool calls on non-Anthropic providers (resume loop + Date-in-output rejection)

1 participant