Skip to content

feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region()#274

Draft
qywu wants to merge 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-cuda-graphs-workspaces
Draft

feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region()#274
qywu wants to merge 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-cuda-graphs-workspaces

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 27, 2026

Stacks on #272.

Summary

The base PR (#272) measured 90.7 % reclaim on a synthetic test because only weights, KV cache, and req_to_token_pool live inside saver.region(). The remaining ~9 % sits in:

  • CUDA graph capturescuda_graph_wrapper.py:295 allocates a CUDAGraph per batch size and captures static input/output buffers. On a moderate model this can be 1-3 GiB.
  • Attention-backend workspacestrtllm.py:119 reserves a 512 MiB shared workspace; flashmla.py:141 reserves the flashinfer prefill workspace.

This PR moves both classes of allocation inside torch_memory_saver.region() so /release_memory_occupation reclaims them too.

Changes

  • BaseAttnConfig / MHAConfig / MLAConfig — propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without resorting to globals or env lookups.
  • trtllm.py, flashmla.py — allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True.
  • cuda_graph_wrapper.py — accept memory_saver_adapter; wrap self.capture() in adapter.region() so each CUDAGraph capture and the static buffers captured for it live inside the released region. Static-tensor contents are zeroed across pause/resume, but callers overwrite them on every replay, so this is benign — no graph re-capture is required.
  • model_executor.py — thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single.

Correctness — CUDA graphs under pause/resume

torch_memory_saver.pause() releases the physical pages backing tensors in saver.region() while preserving their virtual addresses. The captured CUDAGraph records pointers, not page contents, so:

  • ✅ The graph's recorded pointers remain valid after resume().
  • ✅ Inputs/outputs are written by the caller on every replay, so zeroed pages on resume are benign.
  • ✅ Captured kernel arguments that reference weights or KV cache are also paused/resumed in lockstep (because those are inside saver.region() already).
  • ⚠️ Any kernel that reads from a static workspace without writing it first would observe zeros after resume. That's not a pattern any of the wrapped backends use today, but anyone adding a new backend that holds state in the workspace across replays should be aware.

Verification on H100

Full-stack measurement (this PR + #272 + #273) on Qwen2-1.5B-Instruct, gpu_memory_utilization=0.5, --attention-backend=triton, --enforce-eager:

Phase GPU used Δ
After model load 41,431 MiB engine footprint: +40,804 MiB
After /release_memory_occupation 1,585 MiB −39,846 MiB = 97.7 % of footprint
After /resume_memory_occupation 41,533 MiB +102 MiB vs. load (allocator overhead)

Up from 90.7 % synthetic-test reclaim for #272 alone — the +7 pp comes from this PR's workspace wrapping (and #273's allocator flush).

Important caveat: --enforce-eager was set during this test to dodge an unrelated FA3 kernel mismatch in the local venv, so the CUDA-graph wrapping path (the largest of the three wins in this PR conceptually) was not exercised end-to-end. The trtllm/flashmla workspace wrapping and the BaseAttnConfig plumbing are exercised. A follow-up run with CUDA graphs enabled is needed to confirm the graph-capture footprint also drops to ~0 after release_memory_occupation.

Test plan

  • Workspace allocations under saver.region() reclaim with the rest of the engine footprint (verified by the 97.7 % full-stack number — workspaces dropped along with weights and KV).
  • Verify --enforce-eager (no CUDA graphs) still works unchanged — verified.
  • Verify enable_memory_saver=False continues to be a no-op (adapter is a Noop in that case) — config plumbing keeps the flag opt-in.
  • CUDA-graph wrapping under pause/resume not yet verified end-to-end — blocked locally by the unrelated FA3 issue. Need a clean H100 run with default attention + CUDA graphs enabled.
  • Generation parity (output identical pre-release / post-resume) when combined with feat(memory-saver): optional CPU staging for round-trip weight preservation #275's CPU staging — verified for the triton + eager path, still TODO for CUDA-graph path.

Follow-ups out of scope

  • MoE Marlin workspace (layers/moe/backends/wna16/marlin.py:145)
  • Vision-encoder workspaces (models/qwen3_vision.py:323)
  • DeepSeek-V4 prefill workspaces (models/deepseek_v4.py:2801/2817)
  • Multimodal encoder cuda-graph (multimodal/encoder_cudagraph.py:267)

These follow the same pattern; left out here to keep this PR reviewable.

…r.region()

The base PR (#272) reclaimed ~90.7% of the engine footprint by releasing
the regions weight_loader / KV cache modules wrap. The remaining ~9.3%
includes:

  - CUDA graph captures (CUDAGraph objects + the static input/output
    tensors they capture)
  - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and
    flashinfer prefill workspace under flashmla

This change brings those allocations inside torch_memory_saver.region()
so they're released by /release_memory_occupation alongside the rest.

Changes:
  - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver
    from server_args into the attention config so backends can opt their
    workspaces in without globals.
  - trtllm.py, flashmla.py: allocate the shared workspace buffer inside
    saver.region() when enable_memory_saver is True.
  - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture()
    in adapter.region() so the CUDAGraph captures and their persistent
    static buffers live inside the released region. Static-tensor contents
    are zeroed across pause/resume but callers overwrite them on every
    replay, so this is benign.
  - model_executor.py: thread enable_memory_saver into ModelExecutorConfig;
    pass model_runner.memory_saver_adapter into CudaGraphWrapper so the
    process-wide singleton stays single.

Expected impact: +5-9% reclaim on top of #272, bringing the total to
roughly 97-99% of the engine footprint on H100. Exact gain depends on
how many graphs are captured and which attention backend is selected.

Stacked on top of #272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant