feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region()#274
Draft
qywu wants to merge 1 commit into
Draft
Conversation
…r.region() The base PR (#272) reclaimed ~90.7% of the engine footprint by releasing the regions weight_loader / KV cache modules wrap. The remaining ~9.3% includes: - CUDA graph captures (CUDAGraph objects + the static input/output tensors they capture) - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and flashinfer prefill workspace under flashmla This change brings those allocations inside torch_memory_saver.region() so they're released by /release_memory_occupation alongside the rest. Changes: - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without globals. - trtllm.py, flashmla.py: allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True. - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture() in adapter.region() so the CUDAGraph captures and their persistent static buffers live inside the released region. Static-tensor contents are zeroed across pause/resume but callers overwrite them on every replay, so this is benign. - model_executor.py: thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single. Expected impact: +5-9% reclaim on top of #272, bringing the total to roughly 97-99% of the engine footprint on H100. Exact gain depends on how many graphs are captured and which attention backend is selected. Stacked on top of #272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on #272.
Summary
The base PR (#272) measured 90.7 % reclaim on a synthetic test because only weights, KV cache, and
req_to_token_poollive insidesaver.region(). The remaining ~9 % sits in:cuda_graph_wrapper.py:295allocates aCUDAGraphper batch size and captures static input/output buffers. On a moderate model this can be 1-3 GiB.trtllm.py:119reserves a 512 MiB shared workspace;flashmla.py:141reserves the flashinfer prefill workspace.This PR moves both classes of allocation inside
torch_memory_saver.region()so/release_memory_occupationreclaims them too.Changes
BaseAttnConfig/MHAConfig/MLAConfig— propagateenable_memory_saverfromserver_argsinto the attention config so backends can opt their workspaces in without resorting to globals or env lookups.trtllm.py,flashmla.py— allocate the shared workspace buffer insidesaver.region()whenenable_memory_saveris True.cuda_graph_wrapper.py— acceptmemory_saver_adapter; wrapself.capture()inadapter.region()so eachCUDAGraphcapture and the static buffers captured for it live inside the released region. Static-tensor contents are zeroed across pause/resume, but callers overwrite them on every replay, so this is benign — no graph re-capture is required.model_executor.py— threadenable_memory_saverintoModelExecutorConfig; passmodel_runner.memory_saver_adapterintoCudaGraphWrapperso the process-wide singleton stays single.Correctness — CUDA graphs under pause/resume
torch_memory_saver.pause()releases the physical pages backing tensors insaver.region()while preserving their virtual addresses. The capturedCUDAGraphrecords pointers, not page contents, so:resume().saver.region()already).Verification on H100
Full-stack measurement (this PR + #272 + #273) on Qwen2-1.5B-Instruct,
gpu_memory_utilization=0.5,--attention-backend=triton,--enforce-eager:/release_memory_occupation/resume_memory_occupationUp from 90.7 % synthetic-test reclaim for #272 alone — the +7 pp comes from this PR's workspace wrapping (and #273's allocator flush).
Important caveat:
--enforce-eagerwas set during this test to dodge an unrelated FA3 kernel mismatch in the local venv, so the CUDA-graph wrapping path (the largest of the three wins in this PR conceptually) was not exercised end-to-end. The trtllm/flashmla workspace wrapping and theBaseAttnConfigplumbing are exercised. A follow-up run with CUDA graphs enabled is needed to confirm the graph-capture footprint also drops to ~0 afterrelease_memory_occupation.Test plan
saver.region()reclaim with the rest of the engine footprint (verified by the 97.7 % full-stack number — workspaces dropped along with weights and KV).--enforce-eager(no CUDA graphs) still works unchanged — verified.enable_memory_saver=Falsecontinues to be a no-op (adapter is a Noop in that case) — config plumbing keeps the flag opt-in.pause/resumenot yet verified end-to-end — blocked locally by the unrelated FA3 issue. Need a clean H100 run with default attention + CUDA graphs enabled.Follow-ups out of scope
layers/moe/backends/wna16/marlin.py:145)models/qwen3_vision.py:323)models/deepseek_v4.py:2801/2817)multimodal/encoder_cudagraph.py:267)These follow the same pattern; left out here to keep this PR reviewable.