Fix XGrammar bitmask initialization and add null check for gen_config in generate method#4349
Fix XGrammar bitmask initialization and add null check for gen_config in generate method#4349windreamer wants to merge 3 commits intoInternLM:mainfrom
Conversation
20ef3f9 to
547c994
Compare
There was a problem hiding this comment.
Pull request overview
This pull request fixes a bug where grammar constraints from structured output requests incorrectly persist and apply to subsequent non-structured output requests when ModelRequest instances are reused. The fix introduces a clearGrammar() method to explicitly reset the grammar state after each inference session completes.
Changes:
- Added
clearGrammar()C++ method to ModelRequest class for resetting grammar state - Added Python binding
clear_grammar()to expose the cleanup functionality - Called
clear_grammar()in the finally block ofasync_stream_infer()to ensure cleanup
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/turbomind/engine/model_request.h | Declares the new clearGrammar() method |
| src/turbomind/engine/model_request.cc | Implements clearGrammar() to reset the grammar_ shared_ptr |
| src/turbomind/python/bind.cpp | Adds Python binding for clear_grammar with appropriate GIL handling |
| lmdeploy/turbomind/turbomind.py | Calls clear_grammar() in finally block after inference completes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f606bb6 to
f81c224
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f81c224 to
de1e35e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
de1e35e to
f74ca9b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Motivation
This PR addresses two critical issues in LMDeploy's guided generation and batch inference functionality:
XGrammar Bitmask Initialization Bug: The guided decoding mechanism initializes token bitmasks to control which tokens are allowed during constrained generation. Previously, the bitmask was initialized with zeros, which incorrectly blocked all tokens instead of allowing all tokens. This caused generation failures when certain sequences in a batch didn't have grammar constraints applied.
Silent Failure on None gen_config: The
batch_infermethod accepts a list ofGenerationConfigobjects, but whenNonewas passed for individual items, the code would silently fail or hang due to unhandled null pointer dereference when accessinggen_config.max_new_tokens.Modifications
Guided Decoding Bitmask Fix (
src/turbomind/generation/guided_decoding.cc):0to-1(all bits set to 1 in two's complement representation forint32_t).Null GenerationConfig Guard (
lmdeploy/serve/core/async_engine.py):GenerationConfig()whengen_configisNone.Regression Test (
tests/test_lmdeploy/test_grammar.py):test_mix_guided_matrixto verify that batch inference works correctly when mixing guided and unguided generation in the same batch.