feat(memory-saver): flush PyTorch alloc cache and IPC handles after pause by qywu · Pull Request #273 · lightseekorg/tokenspeed

qywu · 2026-05-27T00:37:15Z

Stacks on #272.

Summary

torch_memory_saver.pause() releases the physical pages backing tensors in saver.region(), but PyTorch's caching allocator still holds onto its own free pool and any NCCL/IPC handles outside that region. From the driver's perspective those pages are still ours, so co-tenants can't use the headroom pause() just freed.

This adds a torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after pause() in the scheduler-side handler, so the bytes are actually surrendered to the driver pool.

Verification on H100

Combined with #272 alone the synthetic measurement was 90.7 % reclaim of engine footprint. With this PR + #274 stacked on #272, the full-stack reclaim on a real engine (Qwen2-1.5B, 40,804 MiB footprint, --enforce-eager) is 97.7 % — see #272 description for the per-phase table.

Isolated contribution of this PR can't be precisely separated from #274 in the stacked test (both ran together). What can be said:

Without empty_cache() after pause(), the PyTorch caching allocator continues to hold the freed blocks. On a steady-state engine that's typically a few hundred MiB; right after a burst that ended in free → realloc, it can be > 1 GiB.
ipc_collect() adds a small but non-zero amount (NCCL collectives, P2P buffers).
Release latency stays at 23 ms — empty_cache() is essentially free at pause time.

Test plan

Capture nvidia-smi memory.used delta with the empty_cache call across /release_memory_occupation — confirmed in the H100 E2E run.
Verify no perf regression on /resume_memory_occupation (allocator will rebuild the cache lazily on next forward) — resume latency stayed at 18 ms.
Single-process A/B test isolating just empty_cache's contribution (low priority; feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region() #274 is the bigger win in the stack).

…ause torch_memory_saver.pause() releases the physical pages backing tensors in saver.region(), but PyTorch's caching allocator still holds onto its own free pool and any NCCL/IPC handles outside that region. From the driver's perspective those pages remain ours, so co-tenants can't use the headroom that pause() just freed. Call torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after pause() to surrender those bytes as well. On a typical workload this recovers an additional few hundred MiB on top of saver.pause() alone (the exact number depends on allocator fragmentation at the time of release). Stacked on top of #272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu marked this pull request as ready for review May 27, 2026 04:46

qywu requested a review from a team as a code owner May 27, 2026 04:46

qywu merged commit 33682b8 into feat/http-memory-occupation-endpoints May 27, 2026
2 checks passed

qywu deleted the feat/memory-saver-empty-cache branch May 27, 2026 04:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory-saver): flush PyTorch alloc cache and IPC handles after pause#273

feat(memory-saver): flush PyTorch alloc cache and IPC handles after pause#273
qywu merged 1 commit into
feat/http-memory-occupation-endpointsfrom
feat/memory-saver-empty-cache

qywu commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification on H100

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qywu commented May 27, 2026 •

edited

Loading