Skip to content

feat(memory-saver): flush PyTorch alloc cache and IPC handles after pause#273

Merged
qywu merged 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-empty-cache
May 27, 2026
Merged

feat(memory-saver): flush PyTorch alloc cache and IPC handles after pause#273
qywu merged 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-empty-cache

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 27, 2026

Stacks on #272.

Summary

torch_memory_saver.pause() releases the physical pages backing tensors in saver.region(), but PyTorch's caching allocator still holds onto its own free pool and any NCCL/IPC handles outside that region. From the driver's perspective those pages are still ours, so co-tenants can't use the headroom pause() just freed.

This adds a torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after pause() in the scheduler-side handler, so the bytes are actually surrendered to the driver pool.

Verification on H100

Combined with #272 alone the synthetic measurement was 90.7 % reclaim of engine footprint. With this PR + #274 stacked on #272, the full-stack reclaim on a real engine (Qwen2-1.5B, 40,804 MiB footprint, --enforce-eager) is 97.7 % — see #272 description for the per-phase table.

Isolated contribution of this PR can't be precisely separated from #274 in the stacked test (both ran together). What can be said:

  • Without empty_cache() after pause(), the PyTorch caching allocator continues to hold the freed blocks. On a steady-state engine that's typically a few hundred MiB; right after a burst that ended in free → realloc, it can be > 1 GiB.
  • ipc_collect() adds a small but non-zero amount (NCCL collectives, P2P buffers).
  • Release latency stays at 23 ms — empty_cache() is essentially free at pause time.

Test plan

  • Capture nvidia-smi memory.used delta with the empty_cache call across /release_memory_occupation — confirmed in the H100 E2E run.
  • Verify no perf regression on /resume_memory_occupation (allocator will rebuild the cache lazily on next forward) — resume latency stayed at 18 ms.
  • Single-process A/B test isolating just empty_cache's contribution (low priority; feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region() #274 is the bigger win in the stack).

…ause

torch_memory_saver.pause() releases the physical pages backing tensors in
saver.region(), but PyTorch's caching allocator still holds onto its own
free pool and any NCCL/IPC handles outside that region. From the driver's
perspective those pages remain ours, so co-tenants can't use the headroom
that pause() just freed.

Call torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after
pause() to surrender those bytes as well. On a typical workload this
recovers an additional few hundred MiB on top of saver.pause() alone (the
exact number depends on allocator fragmentation at the time of release).

Stacked on top of #272.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu marked this pull request as ready for review May 27, 2026 04:46
@qywu qywu requested a review from a team as a code owner May 27, 2026 04:46
@qywu qywu merged commit 33682b8 into feat/http-memory-occupation-endpoints May 27, 2026
2 checks passed
@qywu qywu deleted the feat/memory-saver-empty-cache branch May 27, 2026 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant