Skip to content

fix: stream idle timeout and goroutine leak in processStepStream#191

Open
amosbird wants to merge 4 commits intocharmbracelet:mainfrom
amosbird:fix/stream-idle-timeout-and-goroutine-leak
Open

fix: stream idle timeout and goroutine leak in processStepStream#191
amosbird wants to merge 4 commits intocharmbracelet:mainfrom
amosbird:fix/stream-idle-timeout-and-goroutine-leak

Conversation

@amosbird
Copy link
Copy Markdown

Summary

  • Stream idle timeout: Add StreamIdleTimeout option to AgentStreamCall/AgentCall. When set, if the LLM stream produces no data within the specified duration, the stream is cancelled and an error is returned. This prevents the agent from hanging indefinitely when a provider's SSE stream stalls mid-response. The timeout uses a separate context so tool execution is unaffected.

  • Goroutine leak fix: processStepStream has ~17 early return paths that could skip close(toolChan) and toolExecutionWg.Wait(), leaking the tool-coordinator goroutine. Fixed by wrapping cleanup in sync.Once + defer.

Test plan

  • TestStreamingAgentIdleTimeout — hanging stream cancelled after idle timeout, returns error
  • TestStreamingAgentIdleTimeoutResetsOnChunks — slow-but-active stream succeeds (timer resets on each chunk)
  • TestStreamingAgentCallbackErrorCleanup — callback error propagates and goroutines are cleaned up
  • All existing tests pass

@amosbird amosbird force-pushed the fix/stream-idle-timeout-and-goroutine-leak branch from 74078ad to 2502026 Compare April 3, 2026 16:04
amosbird added 4 commits April 6, 2026 22:03
- Add StreamIdleTimeout option to AgentStreamCall/AgentCall to cancel
  streams that stop sending data within a configurable duration
- Fix goroutine leak: use sync.Once + defer to ensure close(toolChan)
  and toolExecutionWg.Wait() run on all return paths in processStepStream
When the consumer breaks out of the range loop (yield returns false)
and an idle timeout fires concurrently, withIdleTimeout would call
yield again after it already returned false, violating the
range-over-function protocol and causing a runtime panic:

  runtime error: range function continued iteration after function
  for loop body returned false

Track whether the consumer has stopped with a 'stopped' flag and
skip the timeout error yield when it's already set.
When streamIdleTimeout fires, it cancels the child streamCtx. The
provider then yields a StreamPartTypeError carrying context.Canceled.
The retry logic's isAbortError checked the error value (not the parent
ctx), so it treated this as a user-initiated cancellation and aborted
without retrying.

Two fixes:
- isAbortError now checks ctx.Err() instead of errors.Is(err,
  context.Canceled). If the parent context is still alive, the
  cancellation came from a child (idle timeout), not the user.
- withIdleTimeout replaces context.Canceled errors with a descriptive
  "stream idle timeout exceeded" error when the timer has fired, so
  even if the error escapes the retry loop it won't be confused with
  a user cancel.
- Increase default MaxRetries from 2 to 5 for better resilience during
  long tool-call parameter streaming (e.g. Write tool with large content)
- Call OnRetry callback for idle timeout errors too (providerErr may be
  nil in that case), so callers can clean up partial state on retry
- Update idle timeout test to verify retry count with MaxRetries=2
@amosbird amosbird force-pushed the fix/stream-idle-timeout-and-goroutine-leak branch from 594470f to 803c63d Compare April 6, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant