Skip to content

bugfix: Fix prompt/context length limits and NPU group initialization.#1513

Merged
sanlio36 merged 3 commits into
jd-opensource:preview/deepseek-v4-npufrom
sanlio36:dev_dsv4_merge_main2
May 21, 2026
Merged

bugfix: Fix prompt/context length limits and NPU group initialization.#1513
sanlio36 merged 3 commits into
jd-opensource:preview/deepseek-v4-npufrom
sanlio36:dev_dsv4_merge_main2

Conversation

@sanlio36
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors prompt length validation across the LLMMaster, RecMaster, and VLMMaster classes, and ensures max_tokens_per_batch is correctly propagated in the Master configuration. Additionally, it updates CollectiveCommunicator to support specific host resolution for NPU-based MoE tensor parallel groups. The review identifies recurring issues in the new validation logic, specifically signed/unsigned comparison violations of the style guide (Rule 79) and off-by-one errors in the prompt length checks that could lead to incorrect request rejection or insufficient space for token generation.

Comment on lines +317 to +322
int32_t prompt_token_limit = max_context_len;
if (!options_.enable_chunked_prefill()) {
max_context_len =
std::min(max_context_len, options_.max_tokens_per_batch());
prompt_token_limit =
std::min(prompt_token_limit, options_.max_tokens_per_batch());
}
if (local_prompt_tokens.size() >= max_context_len) {
if (local_prompt_tokens.size() >= prompt_token_limit) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The prompt length check has two issues:

  1. Signed/Unsigned Comparison: Comparing local_prompt_tokens.size() (unsigned) with prompt_token_limit (signed) violates the repository style guide (Rule 79), which requires explicit static_cast for type conversions.
  2. Off-by-one Error: When enable_chunked_prefill is false, the prompt must fit within max_tokens_per_batch. A prompt of size exactly equal to the batch size is valid for prefill, but the >= check incorrectly rejects it. Conversely, for the model's max_position_embeddings, the check should be strict (size < max_context_len) because there must be room for at least one generated token.

Adjusting the limit and using a > comparison correctly handles both constraints.

Suggested change
int32_t prompt_token_limit = max_context_len;
if (!options_.enable_chunked_prefill()) {
max_context_len =
std::min(max_context_len, options_.max_tokens_per_batch());
prompt_token_limit =
std::min(prompt_token_limit, options_.max_tokens_per_batch());
}
if (local_prompt_tokens.size() >= max_context_len) {
if (local_prompt_tokens.size() >= prompt_token_limit) {
int32_t prompt_token_limit = max_context_len - 1;
if (!options_.enable_chunked_prefill()) {
prompt_token_limit =
std::min(prompt_token_limit, options_.max_tokens_per_batch());
}
if (static_cast<int32_t>(local_prompt_tokens.size()) > prompt_token_limit) {
References
  1. Use static_cast for all type conversions. Never use C-style casts. (link)

Comment on lines +753 to +763
int32_t prompt_token_limit = max_context_len;
if (!options_.enable_chunked_prefill()) {
int32_t max_tokens_per_req = options_.max_tokens_per_batch();
if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) {
CHECK_GT(options_.max_seqs_per_batch(), 0)
<< "max_seqs_per_batch must be greater than 0 in multi-round mode";
max_tokens_per_req /= options_.max_seqs_per_batch();
}
max_context_len = std::min(max_context_len, max_tokens_per_req);
prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req);
}
if (prompt_tokens.size() >= max_context_len) {
if (prompt_tokens.size() >= prompt_token_limit) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The prompt length check has two issues:

  1. Signed/Unsigned Comparison: Comparing prompt_tokens.size() (unsigned) with prompt_token_limit (signed) violates the repository style guide (Rule 79).
  2. Off-by-one Error: The >= check incorrectly rejects prompts that exactly fill the batch/request limit when chunked prefill is disabled. For the model limit, the check should be strict (size < max_context_len), but for the batch limit, it should be inclusive (size <= max_tokens_per_req).
Suggested change
int32_t prompt_token_limit = max_context_len;
if (!options_.enable_chunked_prefill()) {
int32_t max_tokens_per_req = options_.max_tokens_per_batch();
if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) {
CHECK_GT(options_.max_seqs_per_batch(), 0)
<< "max_seqs_per_batch must be greater than 0 in multi-round mode";
max_tokens_per_req /= options_.max_seqs_per_batch();
}
max_context_len = std::min(max_context_len, max_tokens_per_req);
prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req);
}
if (prompt_tokens.size() >= max_context_len) {
if (prompt_tokens.size() >= prompt_token_limit) {
int32_t prompt_token_limit = max_context_len - 1;
if (!options_.enable_chunked_prefill()) {
int32_t max_tokens_per_req = options_.max_tokens_per_batch();
if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) {
CHECK_GT(options_.max_seqs_per_batch(), 0)
<< "max_seqs_per_batch must be greater than 0 in multi-round mode";
max_tokens_per_req /= options_.max_seqs_per_batch();
}
prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req);
}
if (static_cast<int32_t>(prompt_tokens.size()) > prompt_token_limit) {
References
  1. Use static_cast for all type conversions. Never use C-style casts. (link)

Comment on lines +334 to +339
int32_t prompt_token_limit = max_context_len;
if (!options_.enable_chunked_prefill()) {
max_context_len =
std::min(max_context_len, options_.max_tokens_per_batch());
prompt_token_limit =
std::min(prompt_token_limit, options_.max_tokens_per_batch());
}
if (prompt_tokens.size() >= max_context_len) {
if (prompt_tokens.size() >= prompt_token_limit) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The prompt length check has two issues:

  1. Signed/Unsigned Comparison: Comparing prompt_tokens.size() (unsigned) with prompt_token_limit (signed) violates the repository style guide (Rule 79).
  2. Off-by-one Error: The >= check incorrectly rejects prompts that exactly fill the batch limit when chunked prefill is disabled. Adjusting the limit and using a > comparison correctly handles both the model context limit and the batch size limit.
Suggested change
int32_t prompt_token_limit = max_context_len;
if (!options_.enable_chunked_prefill()) {
max_context_len =
std::min(max_context_len, options_.max_tokens_per_batch());
prompt_token_limit =
std::min(prompt_token_limit, options_.max_tokens_per_batch());
}
if (prompt_tokens.size() >= max_context_len) {
if (prompt_tokens.size() >= prompt_token_limit) {
int32_t prompt_token_limit = max_context_len - 1;
if (!options_.enable_chunked_prefill()) {
prompt_token_limit =
std::min(prompt_token_limit, options_.max_tokens_per_batch());
}
if (static_cast<int32_t>(prompt_tokens.size()) > prompt_token_limit) {
References
  1. Use static_cast for all type conversions. Never use C-style casts. (link)

@sanlio36 sanlio36 merged commit 71e14f2 into jd-opensource:preview/deepseek-v4-npu May 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant