bugfix: Fix prompt/context length limits and NPU group initialization. by sanlio36 · Pull Request #1513 · jd-opensource/xllm

sanlio36 · 2026-05-21T16:14:39Z

No description provided.

gemini-code-assist

Code Review

This pull request refactors prompt length validation across the LLMMaster, RecMaster, and VLMMaster classes, and ensures max_tokens_per_batch is correctly propagated in the Master configuration. Additionally, it updates CollectiveCommunicator to support specific host resolution for NPU-based MoE tensor parallel groups. The review identifies recurring issues in the new validation logic, specifically signed/unsigned comparison violations of the style guide (Rule 79) and off-by-one errors in the prompt length checks that could lead to incorrect request rejection or insufficient space for token generation.

gemini-code-assist · 2026-05-21T16:17:03Z

+  int32_t prompt_token_limit = max_context_len;
  if (!options_.enable_chunked_prefill()) {
-    max_context_len =
-        std::min(max_context_len, options_.max_tokens_per_batch());
+    prompt_token_limit =
+        std::min(prompt_token_limit, options_.max_tokens_per_batch());
  }
-  if (local_prompt_tokens.size() >= max_context_len) {
+  if (local_prompt_tokens.size() >= prompt_token_limit) {


The prompt length check has two issues:

Signed/Unsigned Comparison: Comparing local_prompt_tokens.size() (unsigned) with prompt_token_limit (signed) violates the repository style guide (Rule 79), which requires explicit static_cast for type conversions.

Off-by-one Error: When enable_chunked_prefill is false, the prompt must fit within max_tokens_per_batch. A prompt of size exactly equal to the batch size is valid for prefill, but the >= check incorrectly rejects it. Conversely, for the model's max_position_embeddings, the check should be strict (size < max_context_len) because there must be room for at least one generated token.

Adjusting the limit and using a > comparison correctly handles both constraints.

Suggested change

int32_t prompt_token_limit = max_context_len;

if (!options_.enable_chunked_prefill()) {

max_context_len =

std::min(max_context_len, options_.max_tokens_per_batch());

prompt_token_limit =

std::min(prompt_token_limit, options_.max_tokens_per_batch());

}

if (local_prompt_tokens.size() >= max_context_len) {

if (local_prompt_tokens.size() >= prompt_token_limit) {

int32_t prompt_token_limit = max_context_len - 1;

if (!options_.enable_chunked_prefill()) {

prompt_token_limit =

std::min(prompt_token_limit, options_.max_tokens_per_batch());

}

if (static_cast<int32_t>(local_prompt_tokens.size()) > prompt_token_limit) {

References

Use static_cast for all type conversions. Never use C-style casts. ^(link)

gemini-code-assist · 2026-05-21T16:17:03Z

+  int32_t prompt_token_limit = max_context_len;
  if (!options_.enable_chunked_prefill()) {
    int32_t max_tokens_per_req = options_.max_tokens_per_batch();
    if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) {
      CHECK_GT(options_.max_seqs_per_batch(), 0)
          << "max_seqs_per_batch must be greater than 0 in multi-round mode";
      max_tokens_per_req /= options_.max_seqs_per_batch();
    }
-    max_context_len = std::min(max_context_len, max_tokens_per_req);
+    prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req);
  }
-  if (prompt_tokens.size() >= max_context_len) {
+  if (prompt_tokens.size() >= prompt_token_limit) {


The prompt length check has two issues:

Signed/Unsigned Comparison: Comparing prompt_tokens.size() (unsigned) with prompt_token_limit (signed) violates the repository style guide (Rule 79).

Off-by-one Error: The >= check incorrectly rejects prompts that exactly fill the batch/request limit when chunked prefill is disabled. For the model limit, the check should be strict (size < max_context_len), but for the batch limit, it should be inclusive (size <= max_tokens_per_req).

Suggested change

int32_t prompt_token_limit = max_context_len;

if (!options_.enable_chunked_prefill()) {

int32_t max_tokens_per_req = options_.max_tokens_per_batch();

if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) {

CHECK_GT(options_.max_seqs_per_batch(), 0)

<< "max_seqs_per_batch must be greater than 0 in multi-round mode";

max_tokens_per_req /= options_.max_seqs_per_batch();

}

max_context_len = std::min(max_context_len, max_tokens_per_req);

prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req);

}

if (prompt_tokens.size() >= max_context_len) {

if (prompt_tokens.size() >= prompt_token_limit) {

int32_t prompt_token_limit = max_context_len - 1;

if (!options_.enable_chunked_prefill()) {

int32_t max_tokens_per_req = options_.max_tokens_per_batch();

if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) {

CHECK_GT(options_.max_seqs_per_batch(), 0)

<< "max_seqs_per_batch must be greater than 0 in multi-round mode";

max_tokens_per_req /= options_.max_seqs_per_batch();

}

prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req);

}

if (static_cast<int32_t>(prompt_tokens.size()) > prompt_token_limit) {

References

Use static_cast for all type conversions. Never use C-style casts. ^(link)

gemini-code-assist · 2026-05-21T16:17:04Z

+  int32_t prompt_token_limit = max_context_len;
  if (!options_.enable_chunked_prefill()) {
-    max_context_len =
-        std::min(max_context_len, options_.max_tokens_per_batch());
+    prompt_token_limit =
+        std::min(prompt_token_limit, options_.max_tokens_per_batch());
  }
-  if (prompt_tokens.size() >= max_context_len) {
+  if (prompt_tokens.size() >= prompt_token_limit) {


The prompt length check has two issues:

Signed/Unsigned Comparison: Comparing prompt_tokens.size() (unsigned) with prompt_token_limit (signed) violates the repository style guide (Rule 79).

Off-by-one Error: The >= check incorrectly rejects prompts that exactly fill the batch limit when chunked prefill is disabled. Adjusting the limit and using a > comparison correctly handles both the model context limit and the batch size limit.

Suggested change

int32_t prompt_token_limit = max_context_len;

if (!options_.enable_chunked_prefill()) {

max_context_len =

std::min(max_context_len, options_.max_tokens_per_batch());

prompt_token_limit =

std::min(prompt_token_limit, options_.max_tokens_per_batch());

}

if (prompt_tokens.size() >= max_context_len) {

if (prompt_tokens.size() >= prompt_token_limit) {

int32_t prompt_token_limit = max_context_len - 1;

if (!options_.enable_chunked_prefill()) {

prompt_token_limit =

std::min(prompt_token_limit, options_.max_tokens_per_batch());

}

if (static_cast<int32_t>(prompt_tokens.size()) > prompt_token_limit) {

References

Use static_cast for all type conversions. Never use C-style casts. ^(link)

jd-opensource#1513)

sanlio36 added 3 commits May 22, 2026 00:13

bugfix: separate prompt and context length limits.

953af44

bugfix: fix npu torch moe_tp group startup host.

b0ca6db

bugfix: fix master init about max_tokens_per_batch.

b717598

sanlio36 requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners May 21, 2026 16:14

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

sanlio36 merged commit 71e14f2 into jd-opensource:preview/deepseek-v4-npu May 21, 2026
6 checks passed

sanlio36 added a commit to sanlio36/xllm that referenced this pull request May 23, 2026

bugfix: Fix prompt/context length limits and NPU group initialization. (

b83099e

jd-opensource#1513)

sanlio36 added a commit to sanlio36/xllm that referenced this pull request May 23, 2026

bugfix: Fix prompt/context length limits and NPU group initialization. (

53a5c8a

jd-opensource#1513)

sanlio36 added a commit to sanlio36/xllm that referenced this pull request May 23, 2026

bugfix: Fix prompt/context length limits and NPU group initialization. (

8f1d7b2

jd-opensource#1513)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: Fix prompt/context length limits and NPU group initialization.#1513

bugfix: Fix prompt/context length limits and NPU group initialization.#1513
sanlio36 merged 3 commits into
jd-opensource:preview/deepseek-v4-npufrom
sanlio36:dev_dsv4_merge_main2

sanlio36 commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sanlio36 commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant