Skip to content

fix: make generation_stream per-thread to fix server crash on worker threads#1275

Open
nish2292 wants to merge 1 commit into
ml-explore:mainfrom
nish2292:feat/fix-sliding-window-server-crash
Open

fix: make generation_stream per-thread to fix server crash on worker threads#1275
nish2292 wants to merge 1 commit into
ml-explore:mainfrom
nish2292:feat/fix-sliding-window-server-crash

Conversation

@nish2292
Copy link
Copy Markdown

@nish2292 nish2292 commented May 14, 2026

Summary

Fixes #1256 - mlx_lm.server crashes with RuntimeError: There is no Stream(gpu, N) in current thread when serving models from the generation thread. Affects all models via the server, especially sliding-window models (RotatingKVCache).

Why #1090 is insufficient

PR #1090 replaced mx.new_stream() with mx.new_thread_local_stream(), fixing the Stream(gpu, 0) variant but leaving an identical bug for Stream(gpu, 1):

  • ThreadLocalStream resolution registers the CommandEncoder only on the thread that triggers it
  • mx.eval() can encounter ops tagged with stream indices whose encoder doesn't exist on the current thread
  • The server's _generate() creates a separate local stream via mx.default_stream(), mismatching the module-level stream used by generate_step() and wired_limit()

Both the issue reporter and a community commenter confirmed this and recommended re-applying #1088's threading.local() approach.

Fix

generation_stream becomes a threading.local()-backed factory function. Each thread gets its own mx.new_stream() with a locally registered CommandEncoder:

  • generate.py: 6 call sites updated to generation_stream()
  • server.py: Uses generation_stream() instead of local mx.default_stream() override
  • BatchGenerator.close(): Guards mx.synchronize() against cross-thread __del__ calls

generation_stream was never public API. External code that imported it as a variable would need to add ().

Test plan

  • test_generation_stream_per_thread - each thread gets its own Stream
  • test_batch_sliding_window_threaded - BatchGenerator + RotatingKVCache on worker thread (exact crash scenario)
  • All existing tests pass (46/46)
  • Locally verified: Gemma 3 1B, Mistral 7B, Qwen2.5-Coder, GLM-4.7-Flash; thread pools, speculative decoding, concurrent requests, asyncio.to_thread

Related

…threads (ml-explore#1256)

Replace the module-level mx.new_thread_local_stream() with a
threading.local()-backed factory function. Each thread creates its
own mx.new_stream() on first access, ensuring the Metal
CommandEncoder is registered on the thread that will use it.

Changes:
- generate.py: generation_stream is now a function (6 call sites
  updated to generation_stream())
- server.py: use generation_stream() instead of local
  mx.default_stream() override
- BatchGenerator.close(): guard mx.synchronize() against cross-thread
  __del__ calls
- tests: add test_generation_stream_per_thread and
  test_batch_sliding_window_threaded

Fixes: ml-explore#1256
@nish2292 nish2292 force-pushed the feat/fix-sliding-window-server-crash branch from 5240ad4 to 60d1b5f Compare May 14, 2026 03:27
@danilopeixoto
Copy link
Copy Markdown

+1

@nish2292 nish2292 marked this pull request as ready for review May 15, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mlx_lm.server crashes with 'There is no Stream(gpu, 1) in current thread' on sliding-window models (mlx-lm 0.31.3)

2 participants