Skip to content

Fix XGrammar bitmask initialization and add null check for gen_config in generate method#4349

Open
windreamer wants to merge 3 commits intoInternLM:mainfrom
windreamer:fix_guided_decoding_reuse
Open

Fix XGrammar bitmask initialization and add null check for gen_config in generate method#4349
windreamer wants to merge 3 commits intoInternLM:mainfrom
windreamer:fix_guided_decoding_reuse

Conversation

@windreamer
Copy link
Collaborator

@windreamer windreamer commented Feb 11, 2026

Motivation

This PR addresses two critical issues in LMDeploy's guided generation and batch inference functionality:

  1. XGrammar Bitmask Initialization Bug: The guided decoding mechanism initializes token bitmasks to control which tokens are allowed during constrained generation. Previously, the bitmask was initialized with zeros, which incorrectly blocked all tokens instead of allowing all tokens. This caused generation failures when certain sequences in a batch didn't have grammar constraints applied.

  2. Silent Failure on None gen_config: The batch_infer method accepts a list of GenerationConfig objects, but when None was passed for individual items, the code would silently fail or hang due to unhandled null pointer dereference when accessing gen_config.max_new_tokens.

Modifications

  1. Guided Decoding Bitmask Fix (src/turbomind/generation/guided_decoding.cc):

    • Changed the bitmask initialization value from 0 to -1 (all bits set to 1 in two's complement representation for int32_t).
    • This ensures that when a sequence doesn't have an active grammar matcher, all tokens are permitted by default rather than none.
  2. Null GenerationConfig Guard (lmdeploy/serve/core/async_engine.py):

    • Added an explicit check to instantiate a default GenerationConfig() when gen_config is None.
    • This prevents downstream code from attempting to access attributes on a null object, which was causing the engine to hang silently.
  3. Regression Test (tests/test_lmdeploy/test_grammar.py):

    • Added test_mix_guided_matrix to verify that batch inference works correctly when mixing guided and unguided generation in the same batch.
    • The test ensures unguided sequences produce arbitrary text (not necessarily conforming to schema) while guided sequences strictly follow the JSON schema constraints.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug where grammar constraints from structured output requests incorrectly persist and apply to subsequent non-structured output requests when ModelRequest instances are reused. The fix introduces a clearGrammar() method to explicitly reset the grammar state after each inference session completes.

Changes:

  • Added clearGrammar() C++ method to ModelRequest class for resetting grammar state
  • Added Python binding clear_grammar() to expose the cleanup functionality
  • Called clear_grammar() in the finally block of async_stream_infer() to ensure cleanup

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/turbomind/engine/model_request.h Declares the new clearGrammar() method
src/turbomind/engine/model_request.cc Implements clearGrammar() to reset the grammar_ shared_ptr
src/turbomind/python/bind.cpp Adds Python binding for clear_grammar with appropriate GIL handling
lmdeploy/turbomind/turbomind.py Calls clear_grammar() in finally block after inference completes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@windreamer windreamer force-pushed the fix_guided_decoding_reuse branch 2 times, most recently from f606bb6 to f81c224 Compare February 12, 2026 13:33
@windreamer windreamer requested a review from Copilot February 12, 2026 13:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@windreamer windreamer force-pushed the fix_guided_decoding_reuse branch from de1e35e to f74ca9b Compare February 13, 2026 09:41
@windreamer windreamer changed the title fix: add clear_grammar to remove grammar from reused model_request Fix XGrammar bitmask initialization and add null check for gen_config in generate method Feb 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant