-
Notifications
You must be signed in to change notification settings - Fork 8
Separate single/multinode and enable CTX #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The changes made in this PR include:
|
RichardChamberlain1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some general tidy up.
I ran the a single node test and it hung on inter-node, see here...
https://ml-ci-internal.amd.com/job/DeepEP/job/Experimental/101/console.
Has this been stress tested? And if so what docker image did you base it off?
|
The docker image is based on I will continue to try running it on OCI. |
| internode::shmem_fence(); | ||
| #else | ||
| internode::shmem_ctx_quiet(ctx); | ||
| if constexpr (multinode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already inside the constexpr(multinode) line 284
| if constexpr (multinode){ // does CTX depend on multinode? | ||
| internode::shmem_long_atomic_add( rdma_recv_count + dst_expert_local_idx * num_ranks + rank, -num_tokens_sent - 1, dst_rank); | ||
| }else{ | ||
| rocshmem::rocshmem_long_p(rdma_recv_count + dst_expert_local_idx * num_ranks + rank, -num_tokens_sent - 1, dst_rank); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be rocshmem_long_atomic_set()? long_p will not issue a store with release semantic, unlike rocshmem_atomic_set.
| __shared__ internode::shmem_ctx_t ctx; | ||
| internode::shmem_wg_ctx_create(&ctx); | ||
| if constexpr (multinode) { | ||
| EP_DEVICE_ASSERT(internode::shmem_wg_ctx_create(&ctx) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend fast-tracking this change as otherwise unexplainable crashes are generated when incorrect run parameters are supplied in env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to separate single-node vs multi-node behavior in the low-latency internode kernels and enable CTX for multinode cases using new template variables.
Changes:
- Adds multinode template specialization to
dispatch/combineand gates CTX create/destroy and several communication paths on it. - Updates several low-level load/atomic helper utilities to add int64-related overloads and a “relaxed” atomic add helper.
- Adjusts benchmarking defaults in test utilities and minor formatting/whitespace in tests/bindings.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
tests/utils.py |
Increases default Kineto iteration count for benchmark/profiling helper. |
tests/test_low_latency.py |
Minor formatting/whitespace change in the spawn call. |
csrc/kernels/utils.cuh |
Adds int64 overloads for acquire loads and introduces atomic_add_relaxed_global. |
csrc/kernels/internode_ll.cu |
Adds multinode template parameter, gates CTX usage, adjusts sync/wait logic, and updates ROCm warp-group settings. |
csrc/deep_ep.cpp |
Whitespace-only change at module close. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| __device__ __forceinline__ uint64_t ld_acquire_sys_global(const int64_t *ptr) { | ||
| int64_t ret; | ||
| #ifdef USE_ROCM | ||
| ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_SYSTEM); | ||
| #else | ||
| asm volatile("ld.acquire.sys.global.u64 %0, [%1];" : "=l"(ret) : "l"(ptr)); | ||
| #endif | ||
| return ret; |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ld_acquire_sys_global(const int64_t*) returns uint64_t but stores into an int64_t and uses the .u64 load path. This should use a uint64_t (or change the function to return int64_t and use the signed variant consistently) to avoid sign/width mismatches and unexpected wrapping on return.
| #ifdef USE_ROCM | ||
| ret = __hip_atomic_fetch_add(const_cast<int*> (ptr), value, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_AGENT); | ||
| #else | ||
| asm volatile("atom.add.release.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value)); |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomic_add_relaxed_global uses __ATOMIC_RELAXED on ROCm but emits atom.add.release... on the non-ROCm path. Either rename the function to atomic_add_release_global (and call that), or change the inline PTX to the relaxed variant so the implementation matches the name/intent.
| asm volatile("atom.add.release.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value)); | |
| asm volatile("atom.add.relaxed.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value)); |
| return ret; | ||
| } | ||
| //inter | ||
| //not used |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The //not used comment above atomic_add_release_global is incorrect (this function is used in internode_ll.cu). Please update/remove the comment to avoid misleading future changes.
| //not used |
| if ((phases & LOW_LATENCY_RECV_PHASE) == 0) | ||
| if constexpr (multinode) | ||
| #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode) | ||
| internode::shmem_wg_ctx_destroy(&ctx); | ||
| #endif | ||
| return; |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue in combine: defined(multinode) in the preprocessor condition is always false because multinode is a template parameter. As a result, the early-return path won’t destroy the ctx even when it was created. Please remove the defined(multinode) check and gate with if constexpr (multinode) + !defined(ROCM_DISABLE_CTX) instead.
| if ((phases & LOW_LATENCY_RECV_PHASE) == 0) | |
| if constexpr (multinode) | |
| #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode) | |
| internode::shmem_wg_ctx_destroy(&ctx); | |
| #endif | |
| return; | |
| if ((phases & LOW_LATENCY_RECV_PHASE) == 0) { | |
| if constexpr (multinode) { | |
| #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) | |
| internode::shmem_wg_ctx_destroy(&ctx); | |
| #endif | |
| } | |
| return; | |
| } |
| def bench_kineto(fn, kernel_names, num_tests: int = 100, suppress_kineto_output: bool = False, | ||
| trace_path: Optional[str] = None, barrier_comm_profiling: bool = False): |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing bench_kineto default num_tests from 30 to 100 will significantly increase the runtime of tests/test_low_latency.py (it calls bench_kineto twice per run, plus the optional barrier/GEMM work). Consider keeping the default smaller and letting callers override it, or make it configurable via an env var/CLI arg so CI runtimes don’t regress unexpectedly.
| __device__ __forceinline__ int ld_acquire_global(const int64_t *ptr) { | ||
| int64_t ret; | ||
| #ifdef USE_ROCM | ||
| ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT); | ||
| #else | ||
| asm volatile("ld.acquire.gpu.global.s32 %0, [%1];" : "=r"(ret) : "l"(ptr)); |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ld_acquire_global(const int64_t* ptr) is internally doing a 32-bit acquire load (ld.acquire.gpu.global.s32) and returns int even though it loads into an int64_t. On ROCm it also returns int despite using an int64_t atomic load. This will truncate 64-bit flags/counters (e.g., rdma_recv_count / rdma_recv_flag) and can break synchronization. Make this overload return int64_t and use the correct 64-bit load instruction/constraints on CUDA (.s64/.u64 with =l).
| __device__ __forceinline__ int ld_acquire_global(const int64_t *ptr) { | |
| int64_t ret; | |
| #ifdef USE_ROCM | |
| ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT); | |
| #else | |
| asm volatile("ld.acquire.gpu.global.s32 %0, [%1];" : "=r"(ret) : "l"(ptr)); | |
| __device__ __forceinline__ int64_t ld_acquire_global(const int64_t *ptr) { | |
| int64_t ret; | |
| #ifdef USE_ROCM | |
| ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT); | |
| #else | |
| asm volatile("ld.acquire.gpu.global.s64 %0, [%1];" : "=l"(ret) : "l"(ptr)); |
| if ((phases & LOW_LATENCY_RECV_PHASE) == 0) | ||
| if constexpr (multinode) | ||
| #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode) | ||
| internode::shmem_wg_ctx_destroy(&ctx); | ||
| #endif | ||
| return; |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The early-return cleanup uses #if ... && defined(multinode) but multinode is a template parameter, not a preprocessor macro, so defined(multinode) is always false. This means shmem_wg_ctx_destroy(&ctx) will never run on this early-return path, potentially leaking/dangling CTX resources when CTX is enabled. Remove the defined(multinode) preprocessor check and rely on if constexpr (multinode) (and !defined(ROCM_DISABLE_CTX)), or otherwise ensure destroy always runs when a ctx was created.
| if ((phases & LOW_LATENCY_RECV_PHASE) == 0) | |
| if constexpr (multinode) | |
| #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode) | |
| internode::shmem_wg_ctx_destroy(&ctx); | |
| #endif | |
| return; | |
| if ((phases & LOW_LATENCY_RECV_PHASE) == 0) { | |
| #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) | |
| if constexpr (multinode) { | |
| internode::shmem_wg_ctx_destroy(&ctx); | |
| } | |
| #endif | |
| return; | |
| } |
| #if !defined(ROCM_DISABLE_CTX) | ||
| __shared__ internode::shmem_ctx_t ctx; | ||
| internode::shmem_wg_ctx_create(&ctx); | ||
| if constexpr (multinode) { | ||
| EP_DEVICE_ASSERT(internode::shmem_wg_ctx_create(&ctx) == 0); | ||
| } |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR motivation mentions enabling CTX for multinode, but this file unconditionally defines ROCM_DISABLE_CTX at the top, which compiles out all CTX paths (#if !defined(ROCM_DISABLE_CTX)). As a result, the new multinode CTX logic here won’t take effect. If CTX should be enabled for multinode, the ROCM_DISABLE_CTX define needs to be removed or made conditional so the shmem_wg_ctx_create/destroy code is actually compiled for that case.
| #endif | ||
| constexpr int kNumMaxTopK = 9; | ||
| EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections"); | ||
| // EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections"); |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting out this EP_STATIC_ASSERT removes a useful compile-time safety check that the configured warp budget can handle kNumMaxTopK. If there’s no known compiler issue with it, it should remain enabled (it currently appears to hold for the chosen constants). If it must stay disabled, add a short comment explaining why.
| // EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections"); | |
| EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections"); |
Motivation
This PR continues Amir's commit
cde5f3dand enables CTX for multinode cases using the new template variables.Submission Checklist