bugfix: use max_concurrent_requests for single block and linear state allocation.#1496
bugfix: use max_concurrent_requests for single block and linear state allocation.#1496pjgao wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces the use of max_seqs_per_batch from the SchedulerConfig with a global flag FLAGS_max_concurrent_requests for KV cache capacity estimation and block manager allocation. Review feedback highlights that direct access to global flags violates Rule 10 of the repository style guide, which prefers configuration structs or constructor parameters over global state. It is recommended to add max_concurrent_requests to the SchedulerConfig or pass it via the Options struct to maintain encapsulation and follow established architectural patterns.
|
|
||
| kv_cache_cap.num_linear_state_blocks() = | ||
| ::xllm::SchedulerConfig::get_instance().max_seqs_per_batch() + 2; | ||
| kv_cache_cap.num_linear_state_blocks() = FLAGS_max_concurrent_requests + 2; |
There was a problem hiding this comment.
Directly accessing FLAGS_max_concurrent_requests violates the repository style guide (Rule 10), which prefers using configuration structs over global flags. Since max_seqs_per_batch is already managed via SchedulerConfig, max_concurrent_requests should also be added to SchedulerConfig and accessed via ::xllm::SchedulerConfig::get_instance().max_concurrent_requests(). This ensures consistency, centralizes configuration, and allows for proper validation during initialization. Additionally, using std::max(FLAGS_max_concurrent_requests, ::xllm::SchedulerConfig::get_instance().max_seqs_per_batch()) would be more robust to ensure sufficient blocks are reserved even if max_concurrent_requests is misconfigured to a value smaller than the batch size.
References
- Do not overuse FLAGS_ global variables. Prefer passing configuration through constructor parameters or config structs. Only use global flags for top-level, process-wide settings. (link)
| /*num_blocks=*/::xllm::SchedulerConfig::get_instance() | ||
| .max_seqs_per_batch() + | ||
| 2, | ||
| /*num_blocks=*/FLAGS_max_concurrent_requests + 2, |
There was a problem hiding this comment.
Directly accessing FLAGS_max_concurrent_requests here violates the style guide (Rule 10) and breaks encapsulation. This value should be passed through the BlockManagerPool::Options struct, following the pattern used for other configuration parameters like num_blocks and block_size. This makes the class easier to test and decouples it from global state.
References
- Do not overuse FLAGS_ global variables. Prefer passing configuration through constructor parameters or config structs. (link)
| "SCHEDULER OPTIONS", | ||
| {"max_tokens_per_batch", | ||
| "max_seqs_per_batch", | ||
| "max_concurrent_requests", |
|
|
||
| kv_cache_cap.num_linear_state_blocks() = | ||
| ::xllm::SchedulerConfig::get_instance().max_seqs_per_batch() + 2; | ||
| ::xllm::SchedulerConfig::get_instance().max_concurrent_requests() + 2; |
There was a problem hiding this comment.
vlm_engine.cpp 需要类似修改吗
There was a problem hiding this comment.
Ah right, the qwen3.5 VLM code hasn't been merged yet. Never mind — let's revisit this after the VLM PR lands.
404ba50 to
2989452
Compare
jd-opensource#1413) Co-authored-by: kangmeng3 <kangmeng3@jd.com>
Co-authored-by: pjgao <gaopengju3@huawei.com>
2989452 to
446c057
Compare
Summary
Cherry-pick two bugfixes from
preview/qwen3.5-qwen3.6branch tomain.Cherry-picked Commits
9b64c30
bugfix: change single block manager blocks to max concurrent requests. (#1413)005adbd1xllm/core/framework/block/block_manager_pool.cppSingleBlockManagernum_blocks frommax_seqs_per_batch + 2→max_concurrent_requests + 2b2900e7
bugfix: fix high concurrency linear state overflow (#1422)c7ec380fxllm/core/distributed_runtime/llm_engine.cppnum_linear_state_blocksfrommax_seqs_per_batch + 2→max_concurrent_requests + 2, update error messagesRoot Cause
Both bugs use
max_seqs_per_batchas the allocation basis for shared single-block and linear-state resources. Whenmax_concurrent_requests > max_seqs_per_batch, reserved block count is insufficient, causing resource exhaustion and overflow errors under high concurrency.Note
c7ec380falso changedvlm_engine.cpp, but VLMEngine onmainhas been refactored and no longer contains linear-state reservation logic, so that part was dropped.git cherry-pick(Author field unchanged).