Skip to content

fix: stateful-tools domain handling + preserve tool output in conversation history#237

Merged
v1r3n merged 7 commits into
mainfrom
feat/stateful-tools-and-tool-output-fix
May 16, 2026
Merged

fix: stateful-tools domain handling + preserve tool output in conversation history#237
v1r3n merged 7 commits into
mainfrom
feat/stateful-tools-and-tool-output-fix

Conversation

@v1r3n

@v1r3n v1r3n commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Two critical fixes carved out of the larger coding-agent work (PR #163) as a small, independently-reviewable PR. The PAC/PAE work is being landed separately.

a) Stateful tools — domain handling

A series of fixes for stateful agents (`stateful=True` / `@tool(stateful=True)`) where workers must register under the execution's Conductor domain for isolation:

  • `fix: register handoff_check worker for SWARM parents without explicit handoffs` — any SWARM parent always needs a `handoff_check` worker; the previous gate only registered when `agent.handoffs` was non-empty, causing SWARM workflows to stall.
  • `feat(sdk): add _resolve_worker_domain + use worker_domain at 4 sites` — single source of truth for resolving the per-execution worker domain; used by `run()`, `start()`, `stream()`, and `stream_async()` so every entry point computes the same domain.
  • `fix(sdk): re-register stateful workers when polling domain changes` — when the polling thread observes a domain it hasn't registered, it re-runs the late-registration path. Prevents long-lived agents from getting stuck after a domain rebind.
  • `test(e2e): add suite 14 — stateful domain propagation` — 6 deterministic e2e tests (no LLM-output parsing) for the domain-propagation contract.
  • `test(ts): prove SWARM handoff_check registration works for all topologies` — TS-side regression net for the SWARM fix.

b) Tool call truncation — preserve tool output in conversation history

`AgentChatCompleteTaskMapper.compactToolHistory()` previously truncated any tool result older than the 6 most recent to 500 chars (`...[truncated]`). That caused the agent to lose context — a 5KB `glob_find` result kept only ~500 chars of file names, so on the next turn the agent re-issued the same `glob_find` with a different filter, then re-read the same files. Observed in workflow `637d179b-e0b5-4efd-a33f-2b2811ccbc01`.

After this fix:

  • Tool result content is never truncated mid-conversation.
  • Token-budget pressure flows through `condenseIfNeeded` which drops entire old messages — cleaner than partial truncation.
  • The only remaining transformation is collapsing write-only tool confirmations (`contextbook_write`, `contextbook_summary`) to `[ok]`, since those carry no downstream value.

Test plan

  • `./gradlew test` (server) → BUILD SUCCESSFUL, 578 tests pass
  • `pytest tests/unit/` (Python SDK) → 1572 pass, 0 fail
  • Watch CI run for full python-e2e + typescript-e2e on this branch

Scope

This PR intentionally does not include:

  • PAC/PAE (Plan-and-Compile / Plan-and-Execute) — separate PR
  • Worker liveness monitoring — separate PR if desired
  • Issue-fixer harness / examples — separate branch

v1r3n added 6 commits May 14, 2026 00:18
6 tests that verify workers register under the correct Conductor domain
when stateful=True. All assertions inspect the workflow execution via
server API — no mocks, no LLM output parsing, fully deterministic.

Tests:
1. Stateful tool completes (not stuck SCHEDULED)
2. Stateful stop_when callback executes in domain
3. Stateful swarm handoff + termination execute in domain
4. Mixed @tool and @tool(stateful=True) share domain
5. Concurrent stateful executions get different domains (isolation)
6. Non-stateful agents work without domain (regression guard)

Counterfactual verified: removing the domain fix causes test 2 to fail
(stop_when stuck SCHEDULED with pollCount=0, timeout after 300s).
… handoffs

Server always generates {parent}_handoff_check for SWARM workflows, but SDK
only registered the worker when agent.handoffs was truthy. This caused SWARM
loops (like coder↔qa) to deadlock when the parent had no explicit handoffs
(only children had OnTextMention handoffs).

Fix: also register when agent.strategy == "swarm" and agent.agents.
Includes 37 deterministic tests covering all multi-agent strategies.
…gies

TypeScript SDK already had the correct condition (handoffs.length > 0 ||
strategy === "swarm") unlike the Python bug, but had no tests proving it.

Adds 9 tests:
- SWARM parent with NO handoffs (issue fixer exact pattern)
- Bare SWARM, 3-agent SWARM, requiredWorkers filter
- Non-SWARM strategies correctly skip handoff_check
- Single agent (no children) correctly skips
- Counterfactual: old buggy condition (handoffs only) would miss SWARM
- Issue fixer exact topology verification
Required for Mode A idempotency-replay handling: when the server
returns an existing execution for an idempotency_key, workers must
register under the workflow's original taskToDomain, not the freshly
generated run_id. This was assumed by the design spec but was actually
uncommitted user work on coding_agent — pulling it forward here.

Updates the 4 start/stream call sites to compute worker_domain via
_resolve_worker_domain(execution_id, run_id) and use it for
_prepare_workers, _register_and_start_skill_workers, and the
LocalLivenessCheck registered-pair query.
A stateful agent re-run with the same task names but a different
``run_id`` (e.g., a fresh execution that isn't an idempotency replay)
needs ``WorkerManager.start()`` to register the new ``(task_name,
domain)`` pairs. The previous ``elif new_workers:`` branch only fired
when the *task-name set* changed; identical names under a different
domain were silently skipped, leaving the new domain unpolled.

Convert to ``else:`` at all four sites (sync/async ``run`` and
``start``, plus the framework and graph paths). ``WorkerManager`` is
domain-aware and only spawns missing pairs, so this remains safe under
concurrent re-runs and avoids the fork() deadlock window of a full
stop/restart cycle.

Adds test_prepare_workers_starts_worker_manager_when_only_domain_changes
to lock the contract: start() is called once even when the task names
have already been registered, as long as the domain differs.
The previous compactToolHistory implementation truncated any tool result
older than the 6 most recent to 500 chars (with '...[truncated]'). That
caused the agent to lose context — a 5KB glob_find result kept only ~500
chars of file names, so on the next turn the agent re-issued the same
glob_find with a different filter, then re-read the same files. Observed
in workflow 637d179b-e0b5-4efd-a33f-2b2811ccbc01 where iter 14's
glob_find result was clipped to '...AgentspanAIMod...[truncated]' and
the agent kept reissuing nearly-identical queries trying to see more.

Removed:
  * Mid-conversation truncation of older tool results
  * contextbook_read 'keep only latest per section' dedup (relied on
    truncation; now that we keep full results, dedup is unnecessary)
  * truncateToolResult helper and TOOL_RESULT_TRUNCATE_LENGTH /
    RECENT_TOOL_RESULTS_TO_KEEP constants — no longer used

Kept:
  * Write-only tool collapsing — contextbook_write / contextbook_summary
    produce 'wrote N chars' confirmations that add no downstream value
    and are emitted by the agent itself (can't lose information)

Token-budget pressure flows through condenseIfNeeded, which drops ENTIRE
old messages — cleaner than partial truncation since the agent either
has full context for a message or doesn't see it at all.

Verified: ./gradlew test → 578 server tests pass.

@bradyyie bradyyie left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , tested as much as i could isolated and seems to work ! and all existing tests pass

@v1r3n v1r3n merged commit cf8e17f into main May 16, 2026
8 checks passed
@v1r3n v1r3n deleted the feat/stateful-tools-and-tool-output-fix branch May 16, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants