fix(agent): gate handoff on provider token usage, not byte estimate#821
Open
tlongwell-block wants to merge 1 commit into
Open
fix(agent): gate handoff on provider token usage, not byte estimate#821tlongwell-block wants to merge 1 commit into
tlongwell-block wants to merge 1 commit into
Conversation
Handoff gate measured bytes (~12 MiB) while the real limit is tokens, so the token window was exhausted long before the byte threshold and the gate was dead code — the next request 400'd with no recovery. Gate on the provider's reported input-token usage (cache-summed for Anthropic/Databricks, prompt_tokens for OpenAI). Token-first with output headroom, a measured-byte delta so history grown since the last request can't sneak past, and a conservative 1 B/tok byte fallback for the first/unknown call. Cleared on handoff, preserved on missing usage. New SPROUT_AGENT_MAX_CONTEXT_TOKENS knob (default 128k). Non-streaming only, no tokenizer dep. 79 unit + 19 regression green; clippy -D warnings + fmt clean. Live-verified end-to-end across Anthropic/OpenAI/Databricks. Signed-off-by: npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d <011987e296fd5006292d2f930b574be47c7801048d1983c46c425d3c95f0cffd@sprout-oss.stage.blox.sqprod.co>
814bf37 to
0283732
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(agent): gate handoff on provider token usage, not byte estimate
Problem
Agents fill their context, the inference provider 400s on the next request,
and the handoff never fires — the turn dies and the user's message can be
silently dropped (the batch dead-letters after retries). Root cause: the
handoff gate measured bytes (
max_history_bytes * 0.75, default ~12 MiB)while the real limit is in tokens. A normal model's token window is
exhausted long before the byte threshold is reached, so the gate is effectively
dead code, and the resulting 400 had no recovery path.
(Investigation: RESEARCH/SPROUT_HANDOFF_400_ROOT_CAUSE.md.)
Fix
Gate the handoff on the provider's reported input-token usage, which every
supported provider returns on the (non-streaming) response we already parse.
LlmResponsegainsinput_tokens: Option<u64>— the cache-summed inputtotal. Anthropic/Databricks sum
input_tokens + cache_read + cache_creation(plain
input_tokensexcludes cached tokens); OpenAI usesprompt_tokens;Responses uses
input_tokens. Missing usage →None.last_request_input_tokenspaired withlast_request_history_bytes(captured at the same instant — the historyactually sent to that request).
should_handoffis token-first: it projectsmeasured_tokens + estimate(current_bytes - measured_bytes)againsttoken_threshold = min(window * 3/4, window - max_output_tokens)(reserves output headroom). The projection accounts for history appended
since the measurement (tool results, next prompt) — closing the stale-usage
gap. Estimate uses 1 byte/token, a guaranteed upper bound on tokens.
existing byte budget so it can only be more conservative than before.
usage: None.SPROUT_AGENT_MAX_CONTEXT_TOKENS(default 128_000),validated to exceed
max_output_tokens.Non-streaming only (sprout-agent does not stream). No tokenizer dependency.
Tests
token_usage_over_budget_triggers_handoff— usage over threshold hands offbefore the next complete(), on tiny prompts (proves token gate, not bytes).
stale_usage_plus_history_growth_triggers_handoff— usage under thresholdbut a large tool result grows history past it; the projection fires.
cargo test -p sprout-agentgreen (79 unit + 19 regression), clippy--all-targets -D warningsclean,cargo fmt --checkclean.Reviewed by Max (independent re-apply + full verification): 9/10+ on
minimalness, elegance, correctness.