feat(memory-saver): flush PyTorch alloc cache and IPC handles after pause#273
Merged
qywu merged 1 commit intoMay 27, 2026
Merged
Conversation
…ause torch_memory_saver.pause() releases the physical pages backing tensors in saver.region(), but PyTorch's caching allocator still holds onto its own free pool and any NCCL/IPC handles outside that region. From the driver's perspective those pages remain ours, so co-tenants can't use the headroom that pause() just freed. Call torch.cuda.empty_cache() + torch.cuda.ipc_collect() immediately after pause() to surrender those bytes as well. On a typical workload this recovers an additional few hundred MiB on top of saver.pause() alone (the exact number depends on allocator fragmentation at the time of release). Stacked on top of #272. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on #272.
Summary
torch_memory_saver.pause()releases the physical pages backing tensors insaver.region(), but PyTorch's caching allocator still holds onto its own free pool and any NCCL/IPC handles outside that region. From the driver's perspective those pages are still ours, so co-tenants can't use the headroompause()just freed.This adds a
torch.cuda.empty_cache()+torch.cuda.ipc_collect()immediately afterpause()in the scheduler-side handler, so the bytes are actually surrendered to the driver pool.Verification on H100
Combined with #272 alone the synthetic measurement was 90.7 % reclaim of engine footprint. With this PR + #274 stacked on #272, the full-stack reclaim on a real engine (Qwen2-1.5B, 40,804 MiB footprint,
--enforce-eager) is 97.7 % — see #272 description for the per-phase table.Isolated contribution of this PR can't be precisely separated from #274 in the stacked test (both ran together). What can be said:
empty_cache()afterpause(), the PyTorch caching allocator continues to hold the freed blocks. On a steady-state engine that's typically a few hundred MiB; right after a burst that ended infree → realloc, it can be > 1 GiB.ipc_collect()adds a small but non-zero amount (NCCL collectives, P2P buffers).empty_cache()is essentially free at pause time.Test plan
nvidia-smi memory.useddelta with theempty_cachecall across/release_memory_occupation— confirmed in the H100 E2E run./resume_memory_occupation(allocator will rebuild the cache lazily on next forward) — resume latency stayed at 18 ms.empty_cache's contribution (low priority; feat(memory-saver): wrap CUDA graphs and attention workspaces in saver.region() #274 is the bigger win in the stack).