-
Notifications
You must be signed in to change notification settings - Fork 259
Description
Problem
When a conversation approaches or exceeds the model's context window, users see raw HTTP error messages with request IDs, URLs, and JSON payloads instead of actionable guidance. There is no automatic recovery — the session is stuck until the user discovers the /compact workaround by trial and error.
Observed errors
Two distinct failure patterns have been observed from the same session:
1. Anthropic 500 Internal Server Error (opaque context overflow)
all models failed: error receiving from stream: POST "https://api.anthropic.com/v1/messages?beta=true": 500 Internal Server Error (Request-ID: req_011CYBcGTSLUALfwB2eaiNPT)
{"type":"error","error":{"type":"api_error","message":"Internal server error"},"request_id":"req_011CYBcGTSLUALfwB2eaiNPT"}
Anthropic sometimes returns a 500 instead of a clear 400 when the payload is too large. Since 500 is classified as retryable by isRetryableModelError(), we retry the same request 3 times with exponential backoff — all failing identically because the context hasn't changed.
2. Anthropic 400 thinking budget / prompt too long cascade
all models failed: error receiving from stream: POST "https://api.anthropic.com/v1/messages?beta=true": 400 Bad Request
`max_tokens` must be greater than `thinking.budget_tokens`
all models failed: error receiving from stream: POST "https://api.anthropic.com/v1/messages": 400 Bad Request
prompt is too long: 226360 tokens > 200000 maximum
This is a two-step cascade:
- First, the Beta API path (with thinking enabled) fails because the prompt is so large that
max_tokenscan't accommodate the thinking budget. Anthropic validatesmax_tokens > thinking.budget_tokensbefore checking prompt length, so this is the error we get. - Then, the non-Beta fallback path (without thinking) sends the same oversized prompt and gets the real error:
prompt is too long: 226360 tokens > 200000 maximum.
The adapter's retryFn tries to fix this by clamping max_tokens, but that doesn't help when the input itself exceeds the context limit.
How to reproduce
- Start a long session with many tool calls producing large outputs (file reads, search results, etc.)
- Continue using the session past ~90% of the model's context window (200k tokens for Claude)
- The next request may trigger either error pattern above
- The user sees the raw error and the session appears broken
- If the user sends any follow-up message (e.g., "continue"), the compaction check triggers on the next loop iteration (because the stale token counts from the previous successful response now cross the 90% threshold), compaction runs, and the conversation resumes
Root cause
The compaction check in runtime.go uses stale token counts from the previous response:
if sess.InputTokens+sess.OutputTokens > int64(float64(contextLimit)*0.9) {
r.Summarize(ctx, sess, "", events)
}These counts (sess.InputTokens, sess.OutputTokens) are updated only after a successful response. If a single turn adds enough tokens (e.g., multiple large tool outputs) to push from ~88% to over 100%, the threshold doesn't trigger. The oversized request is sent to the API and fails.
Specific issues
-
Raw error messages shown to the user — The full HTTP error with request IDs, URLs, and JSON payloads is displayed as-is in the TUI (
events <- Error(err.Error())inruntime.go). Not user-friendly and not actionable. -
No auto-recovery on context overflow — When the error is clearly caused by context overflow (either the 500 heuristic or the explicit
prompt is too long400), we could auto-compact and retry instead of stopping the session. -
Wasteful retries on 500 context overflow — A 500 caused by context overflow will fail on every retry since the context doesn't change between attempts. We burn time on 3 attempts + exponential backoff for nothing.
-
Stale token count in compaction check — The 90% threshold relies on token counts that don't include the current turn's tool outputs. The check can miss overflow when a single turn contributes a large number of tokens.
-
max_tokens/thinking.budget_tokenserror is a red herring — When the prompt is too large, the Beta API path fails with a confusing thinking budget constraint error before the real "prompt too long" error appears on the non-Beta fallback. This masks the actual problem.