Skip to content

Add parameter to control whether to record stream#9

Open
RichardWooSJTU wants to merge 2 commits intoPFCCLab:paddlefrom
RichardWooSJTU:num_worst_tokens
Open

Add parameter to control whether to record stream#9
RichardWooSJTU wants to merge 2 commits intoPFCCLab:paddlefrom
RichardWooSJTU:num_worst_tokens

Conversation

@RichardWooSJTU
Copy link

@RichardWooSJTU RichardWooSJTU commented Feb 28, 2026

Summary

Add a skip_x_record_stream parameter (default false) to intranode_dispatch and intranode_combine, allowing callers to skip record_stream on large activation tensors (x, recv_x) to reduce GPU memory pressure
When skip_x_record_stream=True, PyTorch's CUDA caching allocator can reclaim the memory of x/recv_x earlier, instead of holding it until the communication stream finishes

Motivation

record_stream prevents the CUDA caching allocator from reusing a tensor's memory until the recorded stream completes. For large activation tensors like x and recv_x, this significantly increases peak GPU memory usage. In scenarios where the caller can guarantee the lifetime of these tensors externally (e.g., the tensors are kept alive by Python references until the comm stream finishes), skipping record_stream is safe and reduces memory footprint.

Changes

File Change
csrc/deep_ep.hpp Add bool skip_x_record_stream = false to intranode_dispatch and intranode_combine declarations
csrc/deep_ep.cpp Gate x/recv_x record_stream calls behind !skip_x_record_stream in both functions
deep_ep/buffer.py Add skip_x_record_stream: bool = False to dispatch() and combine(), pass through to C++ runtime

Usage

# Default behavior unchanged
recv_x, *_ = buffer.dispatch(x, ...)

# Skip record_stream on x/recv_x to save memory
recv_x, *_ = buffer.dispatch(x, ..., skip_x_record_stream=True)
recv_x, *_ = buffer.combine(x, handle, ..., skip_x_record_stream=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant