fix: make generation_stream per-thread to fix server crash on worker threads by nish2292 · Pull Request #1275 · ml-explore/mlx-lm

nish2292 · 2026-05-14T03:19:30Z

Summary

Fixes #1256 - mlx_lm.server crashes with RuntimeError: There is no Stream(gpu, N) in current thread when serving models from the generation thread. Affects all models via the server, especially sliding-window models (RotatingKVCache).

Why #1090 is insufficient

PR #1090 replaced mx.new_stream() with mx.new_thread_local_stream(), fixing the Stream(gpu, 0) variant but leaving an identical bug for Stream(gpu, 1):

ThreadLocalStream resolution registers the CommandEncoder only on the thread that triggers it
mx.eval() can encounter ops tagged with stream indices whose encoder doesn't exist on the current thread
The server's _generate() creates a separate local stream via mx.default_stream(), mismatching the module-level stream used by generate_step() and wired_limit()

Both the issue reporter and a community commenter confirmed this and recommended re-applying #1088's threading.local() approach.

Fix

generation_stream becomes a threading.local()-backed factory function. Each thread gets its own mx.new_stream() with a locally registered CommandEncoder:

generate.py: 6 call sites updated to generation_stream()
server.py: Uses generation_stream() instead of local mx.default_stream() override
BatchGenerator.close(): Guards mx.synchronize() against cross-thread __del__ calls

generation_stream was never public API. External code that imported it as a variable would need to add ().

Test plan

test_generation_stream_per_thread - each thread gets its own Stream
test_batch_sliding_window_threaded - BatchGenerator + RotatingKVCache on worker thread (exact crash scenario)
All existing tests pass (46/46)
Locally verified: Gemma 3 1B, Mistral 7B, Qwen2.5-Coder, GLM-4.7-Flash; thread pools, speculative decoding, concurrent requests, asyncio.to_thread

…threads (ml-explore#1256) Replace the module-level mx.new_thread_local_stream() with a threading.local()-backed factory function. Each thread creates its own mx.new_stream() on first access, ensuring the Metal CommandEncoder is registered on the thread that will use it. Changes: - generate.py: generation_stream is now a function (6 call sites updated to generation_stream()) - server.py: use generation_stream() instead of local mx.default_stream() override - BatchGenerator.close(): guard mx.synchronize() against cross-thread __del__ calls - tests: add test_generation_stream_per_thread and test_batch_sliding_window_threaded Fixes: ml-explore#1256

danilopeixoto · 2026-05-15T03:40:09Z

+1

nish2292 force-pushed the feat/fix-sliding-window-server-crash branch from 5240ad4 to 60d1b5f Compare May 14, 2026 03:27

nish2292 marked this pull request as ready for review May 15, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make generation_stream per-thread to fix server crash on worker threads#1275

fix: make generation_stream per-thread to fix server crash on worker threads#1275
nish2292 wants to merge 1 commit into
ml-explore:mainfrom
nish2292:feat/fix-sliding-window-server-crash

nish2292 commented May 14, 2026 •

edited

Loading

Uh oh!

danilopeixoto commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nish2292 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why #1090 is insufficient

Fix

Test plan

Related

Uh oh!

danilopeixoto commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nish2292 commented May 14, 2026 •

edited

Loading