fix: make generation_stream per-thread to fix server crash on worker threads#1275
Open
nish2292 wants to merge 1 commit into
Open
fix: make generation_stream per-thread to fix server crash on worker threads#1275nish2292 wants to merge 1 commit into
nish2292 wants to merge 1 commit into
Conversation
…threads (ml-explore#1256) Replace the module-level mx.new_thread_local_stream() with a threading.local()-backed factory function. Each thread creates its own mx.new_stream() on first access, ensuring the Metal CommandEncoder is registered on the thread that will use it. Changes: - generate.py: generation_stream is now a function (6 call sites updated to generation_stream()) - server.py: use generation_stream() instead of local mx.default_stream() override - BatchGenerator.close(): guard mx.synchronize() against cross-thread __del__ calls - tests: add test_generation_stream_per_thread and test_batch_sliding_window_threaded Fixes: ml-explore#1256
5240ad4 to
60d1b5f
Compare
|
+1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1256 -
mlx_lm.servercrashes withRuntimeError: There is no Stream(gpu, N) in current threadwhen serving models from the generation thread. Affects all models via the server, especially sliding-window models (RotatingKVCache).Why #1090 is insufficient
PR #1090 replaced
mx.new_stream()withmx.new_thread_local_stream(), fixing theStream(gpu, 0)variant but leaving an identical bug forStream(gpu, 1):ThreadLocalStreamresolution registers theCommandEncoderonly on the thread that triggers itmx.eval()can encounter ops tagged with stream indices whose encoder doesn't exist on the current thread_generate()creates a separate local stream viamx.default_stream(), mismatching the module-level stream used bygenerate_step()andwired_limit()Both the issue reporter and a community commenter confirmed this and recommended re-applying #1088's
threading.local()approach.Fix
generation_streambecomes athreading.local()-backed factory function. Each thread gets its ownmx.new_stream()with a locally registeredCommandEncoder:generate.py: 6 call sites updated togeneration_stream()server.py: Usesgeneration_stream()instead of localmx.default_stream()overrideBatchGenerator.close(): Guardsmx.synchronize()against cross-thread__del__callsgeneration_streamwas never public API. External code that imported it as a variable would need to add().Test plan
test_generation_stream_per_thread- each thread gets its own Streamtest_batch_sliding_window_threaded- BatchGenerator + RotatingKVCache on worker thread (exact crash scenario)Related