Skip to content

fix(agent): gate handoff on provider token usage, not byte estimate#821

Open
tlongwell-block wants to merge 1 commit into
mainfrom
eva/handoff-token-gate
Open

fix(agent): gate handoff on provider token usage, not byte estimate#821
tlongwell-block wants to merge 1 commit into
mainfrom
eva/handoff-token-gate

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

fix(agent): gate handoff on provider token usage, not byte estimate

Problem

Agents fill their context, the inference provider 400s on the next request,
and the handoff never fires — the turn dies and the user's message can be
silently dropped (the batch dead-letters after retries). Root cause: the
handoff gate measured bytes (max_history_bytes * 0.75, default ~12 MiB)
while the real limit is in tokens. A normal model's token window is
exhausted long before the byte threshold is reached, so the gate is effectively
dead code, and the resulting 400 had no recovery path.
(Investigation: RESEARCH/SPROUT_HANDOFF_400_ROOT_CAUSE.md.)

Fix

Gate the handoff on the provider's reported input-token usage, which every
supported provider returns on the (non-streaming) response we already parse.

  • LlmResponse gains input_tokens: Option<u64> — the cache-summed input
    total. Anthropic/Databricks sum input_tokens + cache_read + cache_creation
    (plain input_tokens excludes cached tokens); OpenAI uses prompt_tokens;
    Responses uses input_tokens. Missing usage → None.
  • Session stores last_request_input_tokens paired with
    last_request_history_bytes (captured at the same instant — the history
    actually sent to that request).
  • should_handoff is token-first: it projects
    measured_tokens + estimate(current_bytes - measured_bytes) against
    token_threshold = min(window * 3/4, window - max_output_tokens)
    (reserves output headroom). The projection accounts for history appended
    since the measurement (tool results, next prompt) — closing the stale-usage
    gap. Estimate uses 1 byte/token, a guaranteed upper bound on tokens.
  • First request / missing usage → conservative byte fallback, capped by the
    existing byte budget so it can only be more conservative than before.
  • Both fields cleared on handoff (context reset) and preserved on usage: None.
  • New config knob SPROUT_AGENT_MAX_CONTEXT_TOKENS (default 128_000),
    validated to exceed max_output_tokens.

Non-streaming only (sprout-agent does not stream). No tokenizer dependency.

Tests

  • Parser unit tests: each provider, usage present/absent, cache-summed.
  • Threshold math unit tests: fraction vs output-headroom, saturation, upper-bound estimate.
  • Integration regressions (proven to fail on the pre-fix logic):
    • token_usage_over_budget_triggers_handoff — usage over threshold hands off
      before the next complete(), on tiny prompts (proves token gate, not bytes).
    • stale_usage_plus_history_growth_triggers_handoff — usage under threshold
      but a large tool result grows history past it; the projection fires.
  • Full cargo test -p sprout-agent green (79 unit + 19 regression), clippy
    --all-targets -D warnings clean, cargo fmt --check clean.

Reviewed by Max (independent re-apply + full verification): 9/10+ on
minimalness, elegance, correctness.

@tlongwell-block tlongwell-block requested a review from a team as a code owner June 2, 2026 19:29
Handoff gate measured bytes (~12 MiB) while the real limit is tokens, so
the token window was exhausted long before the byte threshold and the
gate was dead code — the next request 400'd with no recovery.

Gate on the provider's reported input-token usage (cache-summed for
Anthropic/Databricks, prompt_tokens for OpenAI). Token-first with output
headroom, a measured-byte delta so history grown since the last request
can't sneak past, and a conservative 1 B/tok byte fallback for the
first/unknown call. Cleared on handoff, preserved on missing usage.
New SPROUT_AGENT_MAX_CONTEXT_TOKENS knob (default 128k). Non-streaming
only, no tokenizer dep.

79 unit + 19 regression green; clippy -D warnings + fmt clean.
Live-verified end-to-end across Anthropic/OpenAI/Databricks.

Signed-off-by: npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block force-pushed the eva/handoff-token-gate branch from 814bf37 to 0283732 Compare June 2, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant