Skip to content

bugfix: use max_concurrent_requests for single block and linear state allocation.#1496

Open
pjgao wants to merge 2 commits into
jd-opensource:mainfrom
pjgao:cherry-pick/concurrent-linear-bugfix
Open

bugfix: use max_concurrent_requests for single block and linear state allocation.#1496
pjgao wants to merge 2 commits into
jd-opensource:mainfrom
pjgao:cherry-pick/concurrent-linear-bugfix

Conversation

@pjgao
Copy link
Copy Markdown

@pjgao pjgao commented May 20, 2026

Summary

Cherry-pick two bugfixes from preview/qwen3.5-qwen3.6 branch to main.

Cherry-picked Commits

  1. 9b64c30 bugfix: change single block manager blocks to max concurrent requests. (#1413)

    • Original Author: Yingxu Deng dengyingxu1@jd.com
    • Original Commit: 005adbd1
    • File: xllm/core/framework/block/block_manager_pool.cpp
    • Change: SingleBlockManager num_blocks from max_seqs_per_batch + 2max_concurrent_requests + 2
  2. b2900e7 bugfix: fix high concurrency linear state overflow (#1422)

    • Original Author: Joey Gao 1783198484@qq.com
    • Original Commit: c7ec380f
    • File: xllm/core/distributed_runtime/llm_engine.cpp
    • Change: num_linear_state_blocks from max_seqs_per_batch + 2max_concurrent_requests + 2, update error messages

Root Cause

Both bugs use max_seqs_per_batch as the allocation basis for shared single-block and linear-state resources. When max_concurrent_requests > max_seqs_per_batch, reserved block count is insufficient, causing resource exhaustion and overflow errors under high concurrency.

Note

  • Original c7ec380f also changed vlm_engine.cpp, but VLMEngine on main has been refactored and no longer contains linear-state reservation logic, so that part was dropped.
  • Original authorship preserved via git cherry-pick (Author field unchanged).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the use of max_seqs_per_batch from the SchedulerConfig with a global flag FLAGS_max_concurrent_requests for KV cache capacity estimation and block manager allocation. Review feedback highlights that direct access to global flags violates Rule 10 of the repository style guide, which prefers configuration structs or constructor parameters over global state. It is recommended to add max_concurrent_requests to the SchedulerConfig or pass it via the Options struct to maintain encapsulation and follow established architectural patterns.


kv_cache_cap.num_linear_state_blocks() =
::xllm::SchedulerConfig::get_instance().max_seqs_per_batch() + 2;
kv_cache_cap.num_linear_state_blocks() = FLAGS_max_concurrent_requests + 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Directly accessing FLAGS_max_concurrent_requests violates the repository style guide (Rule 10), which prefers using configuration structs over global flags. Since max_seqs_per_batch is already managed via SchedulerConfig, max_concurrent_requests should also be added to SchedulerConfig and accessed via ::xllm::SchedulerConfig::get_instance().max_concurrent_requests(). This ensures consistency, centralizes configuration, and allows for proper validation during initialization. Additionally, using std::max(FLAGS_max_concurrent_requests, ::xllm::SchedulerConfig::get_instance().max_seqs_per_batch()) would be more robust to ensure sufficient blocks are reserved even if max_concurrent_requests is misconfigured to a value smaller than the batch size.

References
  1. Do not overuse FLAGS_ global variables. Prefer passing configuration through constructor parameters or config structs. Only use global flags for top-level, process-wide settings. (link)

/*num_blocks=*/::xllm::SchedulerConfig::get_instance()
.max_seqs_per_batch() +
2,
/*num_blocks=*/FLAGS_max_concurrent_requests + 2,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Directly accessing FLAGS_max_concurrent_requests here violates the style guide (Rule 10) and breaks encapsulation. This value should be passed through the BlockManagerPool::Options struct, following the pattern used for other configuration parameters like num_blocks and block_size. This makes the class easier to test and decouples it from global state.

References
  1. Do not overuse FLAGS_ global variables. Prefer passing configuration through constructor parameters or config structs. (link)

"SCHEDULER OPTIONS",
{"max_tokens_per_batch",
"max_seqs_per_batch",
"max_concurrent_requests",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

主分支重构后,这个参数丢失了嘛?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,看着像是#1430 这个PR重构的

@yingxudeng yingxudeng changed the title bugfix: use max_concurrent_requests for single block and linear state allocation bugfix: use max_concurrent_requests for single block and linear state allocation. May 20, 2026

kv_cache_cap.num_linear_state_blocks() =
::xllm::SchedulerConfig::get_instance().max_seqs_per_batch() + 2;
::xllm::SchedulerConfig::get_instance().max_concurrent_requests() + 2;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vlm_engine.cpp 需要类似修改吗

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, the qwen3.5 VLM code hasn't been merged yet. Never mind — let's revisit this after the VLM PR lands.

@pjgao pjgao force-pushed the cherry-pick/concurrent-linear-bugfix branch from 404ba50 to 2989452 Compare May 21, 2026 08:54
@pjgao pjgao force-pushed the cherry-pick/concurrent-linear-bugfix branch from 2989452 to 446c057 Compare May 21, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants