feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region() by qywu · Pull Request #274 · lightseekorg/tokenspeed

qywu · 2026-05-27T00:42:07Z

Stacks on #272.

Summary

The base PR (#272) measured 90.7 % reclaim on a synthetic test because only weights, KV cache, and req_to_token_pool live inside saver.region(). The remaining ~9 % sits in:

CUDA graph captures — cuda_graph_wrapper.py:295 allocates a CUDAGraph per batch size and captures static input/output buffers. On a moderate model this can be 1-3 GiB.
Attention-backend workspaces — trtllm.py:119 reserves a 512 MiB shared workspace; flashmla.py:141 reserves the flashinfer prefill workspace.

This PR moves both classes of allocation inside torch_memory_saver.region() so /release_memory_occupation reclaims them too.

Changes

BaseAttnConfig / MHAConfig / MLAConfig — propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without resorting to globals or env lookups.
trtllm.py, flashmla.py — allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True.
cuda_graph_wrapper.py — accept memory_saver_adapter; wrap self.capture() in adapter.region() so each CUDAGraph capture and the static buffers captured for it live inside the released region. Static-tensor contents are zeroed across pause/resume, but callers overwrite them on every replay, so this is benign — no graph re-capture is required.
model_executor.py — thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single.

Correctness — CUDA graphs under pause/resume

torch_memory_saver.pause() releases the physical pages backing tensors in saver.region() while preserving their virtual addresses. The captured CUDAGraph records pointers, not page contents, so:

✅ The graph's recorded pointers remain valid after resume().
✅ Inputs/outputs are written by the caller on every replay, so zeroed pages on resume are benign.
✅ Captured kernel arguments that reference weights or KV cache are also paused/resumed in lockstep (because those are inside saver.region() already).
⚠️ Any kernel that reads from a static workspace without writing it first would observe zeros after resume. That's not a pattern any of the wrapped backends use today, but anyone adding a new backend that holds state in the workspace across replays should be aware.

Verification on H100

Full-stack measurement (this PR + #272 + #273) on Qwen2-1.5B-Instruct, gpu_memory_utilization=0.5, --attention-backend=triton, --enforce-eager:

Phase	GPU used	Δ
After model load	41,431 MiB	engine footprint: +40,804 MiB
After `/release_memory_occupation`	1,585 MiB	−39,846 MiB = 97.7 % of footprint
After `/resume_memory_occupation`	41,533 MiB	+102 MiB vs. load (allocator overhead)

Up from 90.7 % synthetic-test reclaim for #272 alone — the +7 pp comes from this PR's workspace wrapping (and #273's allocator flush).

Important caveat: --enforce-eager was set during this test to dodge an unrelated FA3 kernel mismatch in the local venv, so the CUDA-graph wrapping path (the largest of the three wins in this PR conceptually) was not exercised end-to-end. The trtllm/flashmla workspace wrapping and the BaseAttnConfig plumbing are exercised. A follow-up run with CUDA graphs enabled is needed to confirm the graph-capture footprint also drops to ~0 after release_memory_occupation.

Test plan

Workspace allocations under saver.region() reclaim with the rest of the engine footprint (verified by the 97.7 % full-stack number — workspaces dropped along with weights and KV).
Verify --enforce-eager (no CUDA graphs) still works unchanged — verified.
Verify enable_memory_saver=False continues to be a no-op (adapter is a Noop in that case) — config plumbing keeps the flag opt-in.
CUDA-graph wrapping under pause/resume not yet verified end-to-end — blocked locally by the unrelated FA3 issue. Need a clean H100 run with default attention + CUDA graphs enabled.
Generation parity (output identical pre-release / post-resume) when combined with feat(memory-saver): optional CPU staging for round-trip weight preservation #275's CPU staging — verified for the triton + eager path, still TODO for CUDA-graph path.

Follow-ups out of scope

MoE Marlin workspace (layers/moe/backends/wna16/marlin.py:145)
Vision-encoder workspaces (models/qwen3_vision.py:323)
DeepSeek-V4 prefill workspaces (models/deepseek_v4.py:2801/2817)
Multimodal encoder cuda-graph (multimodal/encoder_cudagraph.py:267)

These follow the same pattern; left out here to keep this PR reviewable.

…r.region() The base PR (#272) reclaimed ~90.7% of the engine footprint by releasing the regions weight_loader / KV cache modules wrap. The remaining ~9.3% includes: - CUDA graph captures (CUDAGraph objects + the static input/output tensors they capture) - Attention-backend workspace buffers — TRTLLM_MHA (512 MiB) and flashinfer prefill workspace under flashmla This change brings those allocations inside torch_memory_saver.region() so they're released by /release_memory_occupation alongside the rest. Changes: - BaseAttnConfig / MHAConfig / MLAConfig: propagate enable_memory_saver from server_args into the attention config so backends can opt their workspaces in without globals. - trtllm.py, flashmla.py: allocate the shared workspace buffer inside saver.region() when enable_memory_saver is True. - cuda_graph_wrapper.py: accept memory_saver_adapter; wrap self.capture() in adapter.region() so the CUDAGraph captures and their persistent static buffers live inside the released region. Static-tensor contents are zeroed across pause/resume but callers overwrite them on every replay, so this is benign. - model_executor.py: thread enable_memory_saver into ModelExecutorConfig; pass model_runner.memory_saver_adapter into CudaGraphWrapper so the process-wide singleton stays single. Expected impact: +5-9% reclaim on top of #272, bringing the total to roughly 97-99% of the engine footprint on H100. Exact gain depends on how many graphs are captured and which attention backend is selected. Stacked on top of #272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region()#274

feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region()#274
qywu wants to merge 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-cuda-graphs-workspaces

qywu commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Correctness — CUDA graphs under pause/resume

Verification on H100

Test plan

Follow-ups out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qywu commented May 27, 2026 •

edited

Loading