bugfix: Fix prompt/context length limits and NPU group initialization.#1513
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors prompt length validation across the LLMMaster, RecMaster, and VLMMaster classes, and ensures max_tokens_per_batch is correctly propagated in the Master configuration. Additionally, it updates CollectiveCommunicator to support specific host resolution for NPU-based MoE tensor parallel groups. The review identifies recurring issues in the new validation logic, specifically signed/unsigned comparison violations of the style guide (Rule 79) and off-by-one errors in the prompt length checks that could lead to incorrect request rejection or insufficient space for token generation.
| int32_t prompt_token_limit = max_context_len; | ||
| if (!options_.enable_chunked_prefill()) { | ||
| max_context_len = | ||
| std::min(max_context_len, options_.max_tokens_per_batch()); | ||
| prompt_token_limit = | ||
| std::min(prompt_token_limit, options_.max_tokens_per_batch()); | ||
| } | ||
| if (local_prompt_tokens.size() >= max_context_len) { | ||
| if (local_prompt_tokens.size() >= prompt_token_limit) { |
There was a problem hiding this comment.
The prompt length check has two issues:
- Signed/Unsigned Comparison: Comparing
local_prompt_tokens.size()(unsigned) withprompt_token_limit(signed) violates the repository style guide (Rule 79), which requires explicitstatic_castfor type conversions. - Off-by-one Error: When
enable_chunked_prefillis false, the prompt must fit withinmax_tokens_per_batch. A prompt of size exactly equal to the batch size is valid for prefill, but the>=check incorrectly rejects it. Conversely, for the model'smax_position_embeddings, the check should be strict (size < max_context_len) because there must be room for at least one generated token.
Adjusting the limit and using a > comparison correctly handles both constraints.
| int32_t prompt_token_limit = max_context_len; | |
| if (!options_.enable_chunked_prefill()) { | |
| max_context_len = | |
| std::min(max_context_len, options_.max_tokens_per_batch()); | |
| prompt_token_limit = | |
| std::min(prompt_token_limit, options_.max_tokens_per_batch()); | |
| } | |
| if (local_prompt_tokens.size() >= max_context_len) { | |
| if (local_prompt_tokens.size() >= prompt_token_limit) { | |
| int32_t prompt_token_limit = max_context_len - 1; | |
| if (!options_.enable_chunked_prefill()) { | |
| prompt_token_limit = | |
| std::min(prompt_token_limit, options_.max_tokens_per_batch()); | |
| } | |
| if (static_cast<int32_t>(local_prompt_tokens.size()) > prompt_token_limit) { |
References
- Use static_cast for all type conversions. Never use C-style casts. (link)
| int32_t prompt_token_limit = max_context_len; | ||
| if (!options_.enable_chunked_prefill()) { | ||
| int32_t max_tokens_per_req = options_.max_tokens_per_batch(); | ||
| if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) { | ||
| CHECK_GT(options_.max_seqs_per_batch(), 0) | ||
| << "max_seqs_per_batch must be greater than 0 in multi-round mode"; | ||
| max_tokens_per_req /= options_.max_seqs_per_batch(); | ||
| } | ||
| max_context_len = std::min(max_context_len, max_tokens_per_req); | ||
| prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req); | ||
| } | ||
| if (prompt_tokens.size() >= max_context_len) { | ||
| if (prompt_tokens.size() >= prompt_token_limit) { |
There was a problem hiding this comment.
The prompt length check has two issues:
- Signed/Unsigned Comparison: Comparing
prompt_tokens.size()(unsigned) withprompt_token_limit(signed) violates the repository style guide (Rule 79). - Off-by-one Error: The
>=check incorrectly rejects prompts that exactly fill the batch/request limit when chunked prefill is disabled. For the model limit, the check should be strict (size < max_context_len), but for the batch limit, it should be inclusive (size <= max_tokens_per_req).
| int32_t prompt_token_limit = max_context_len; | |
| if (!options_.enable_chunked_prefill()) { | |
| int32_t max_tokens_per_req = options_.max_tokens_per_batch(); | |
| if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) { | |
| CHECK_GT(options_.max_seqs_per_batch(), 0) | |
| << "max_seqs_per_batch must be greater than 0 in multi-round mode"; | |
| max_tokens_per_req /= options_.max_seqs_per_batch(); | |
| } | |
| max_context_len = std::min(max_context_len, max_tokens_per_req); | |
| prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req); | |
| } | |
| if (prompt_tokens.size() >= max_context_len) { | |
| if (prompt_tokens.size() >= prompt_token_limit) { | |
| int32_t prompt_token_limit = max_context_len - 1; | |
| if (!options_.enable_chunked_prefill()) { | |
| int32_t max_tokens_per_req = options_.max_tokens_per_batch(); | |
| if (rec_type_ == RecType::kLlmRec && is_rec_multi_round_mode()) { | |
| CHECK_GT(options_.max_seqs_per_batch(), 0) | |
| << "max_seqs_per_batch must be greater than 0 in multi-round mode"; | |
| max_tokens_per_req /= options_.max_seqs_per_batch(); | |
| } | |
| prompt_token_limit = std::min(prompt_token_limit, max_tokens_per_req); | |
| } | |
| if (static_cast<int32_t>(prompt_tokens.size()) > prompt_token_limit) { |
References
- Use static_cast for all type conversions. Never use C-style casts. (link)
| int32_t prompt_token_limit = max_context_len; | ||
| if (!options_.enable_chunked_prefill()) { | ||
| max_context_len = | ||
| std::min(max_context_len, options_.max_tokens_per_batch()); | ||
| prompt_token_limit = | ||
| std::min(prompt_token_limit, options_.max_tokens_per_batch()); | ||
| } | ||
| if (prompt_tokens.size() >= max_context_len) { | ||
| if (prompt_tokens.size() >= prompt_token_limit) { |
There was a problem hiding this comment.
The prompt length check has two issues:
- Signed/Unsigned Comparison: Comparing
prompt_tokens.size()(unsigned) withprompt_token_limit(signed) violates the repository style guide (Rule 79). - Off-by-one Error: The
>=check incorrectly rejects prompts that exactly fill the batch limit when chunked prefill is disabled. Adjusting the limit and using a>comparison correctly handles both the model context limit and the batch size limit.
| int32_t prompt_token_limit = max_context_len; | |
| if (!options_.enable_chunked_prefill()) { | |
| max_context_len = | |
| std::min(max_context_len, options_.max_tokens_per_batch()); | |
| prompt_token_limit = | |
| std::min(prompt_token_limit, options_.max_tokens_per_batch()); | |
| } | |
| if (prompt_tokens.size() >= max_context_len) { | |
| if (prompt_tokens.size() >= prompt_token_limit) { | |
| int32_t prompt_token_limit = max_context_len - 1; | |
| if (!options_.enable_chunked_prefill()) { | |
| prompt_token_limit = | |
| std::min(prompt_token_limit, options_.max_tokens_per_batch()); | |
| } | |
| if (static_cast<int32_t>(prompt_tokens.size()) > prompt_token_limit) { |
References
- Use static_cast for all type conversions. Never use C-style casts. (link)
71e14f2
into
jd-opensource:preview/deepseek-v4-npu
No description provided.