Skip to content

bugfix: reduce acl graph memory overhead.#1457

Open
RobbieLeung wants to merge 1 commit into
jd-opensource:mainfrom
RobbieLeung:bugfix/qwen35-acl-graph-memory-overhead
Open

bugfix: reduce acl graph memory overhead.#1457
RobbieLeung wants to merge 1 commit into
jd-opensource:mainfrom
RobbieLeung:bugfix/qwen35-acl-graph-memory-overhead

Conversation

@RobbieLeung
Copy link
Copy Markdown
Collaborator

@RobbieLeung RobbieLeung commented May 15, 2026

size acl graph persistent_mask_ by decode/spec-verify graph capacity instead of max_tokens_per_batch.

@RobbieLeung RobbieLeung changed the title bugfix: reduce qwen3.5 acl graph memory overhead. bugfix: reduce acl graph memory overhead. May 15, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes memory allocation for persistent buffers in the ACL graph executor by sizing the attention mask based on decode graph capacity rather than the prefill budget. It also refines the capacity calculation for speculative decoding. Feedback indicates that the updated capacity function is incorrectly used for sequence-indexed metadata, causing significant memory over-allocation for tensors like block tables, which undermines the PR's memory reduction goals.

Comment thread xllm/core/runtime/acl_graph_executor_impl.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants