Separate single/multinode and enable CTX #10

amd-wsung102 · 2025-11-24T19:48:30Z

Motivation

This PR continues Amir's commit cde5f3d and enables CTX for multinode cases using the new template variables.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

amd-wsung102 · 2025-11-24T21:03:44Z

The changes made in this PR include:

Using the existing code changes in this commit.
Enabling CTX in internode_ll.cu for multinode use cases. A multinode case is defined when the newly added template variable multinode is true, which depends on the number of ranks during runtime dispatch.

RichardChamberlain1

Needs some general tidy up.

I ran the a single node test and it hung on inter-node, see here...

https://ml-ci-internal.amd.com/job/DeepEP/job/Experimental/101/console.

Has this been stress tested? And if so what docker image did you base it off?

amd-wsung102 · 2025-12-03T23:06:00Z

The docker image is based on rocm6.3.4_ubuntu24.04_py3.12_pytorch_release_2.4.0. This has not been stress tested. I am currently trying to stress test it on OCI but OCI is showing this error message permission denied while trying to connect to the docker API at unix:///var/run/docker.sock.

I will continue to try running it on OCI.

abouteiller · 2026-01-30T23:32:55Z

csrc/kernels/internode_ll.cu

                    internode::shmem_fence();
 #else
-                    internode::shmem_ctx_quiet(ctx);
+                    if constexpr (multinode)


This is already inside the constexpr(multinode) line 284

abouteiller · 2026-01-30T23:35:26Z

csrc/kernels/internode_ll.cu

+            if constexpr (multinode){ // does CTX depend on multinode?
+                internode::shmem_long_atomic_add( rdma_recv_count + dst_expert_local_idx * num_ranks + rank, -num_tokens_sent - 1, dst_rank);
+            }else{
+                rocshmem::rocshmem_long_p(rdma_recv_count + dst_expert_local_idx * num_ranks + rank, -num_tokens_sent - 1, dst_rank);


Should this be rocshmem_long_atomic_set()? long_p will not issue a store with release semantic, unlike rocshmem_atomic_set.

abouteiller · 2026-01-31T00:12:00Z

csrc/kernels/internode_ll.cu

    __shared__ internode::shmem_ctx_t ctx;
-    internode::shmem_wg_ctx_create(&ctx);
+    if constexpr (multinode) {
+        EP_DEVICE_ASSERT(internode::shmem_wg_ctx_create(&ctx) == 0);


Recommend fast-tracking this change as otherwise unexplainable crashes are generated when incorrect run parameters are supplied in env.

Copilot

Pull request overview

This PR aims to separate single-node vs multi-node behavior in the low-latency internode kernels and enable CTX for multinode cases using new template variables.

Changes:

Adds multinode template specialization to dispatch/combine and gates CTX create/destroy and several communication paths on it.
Updates several low-level load/atomic helper utilities to add int64-related overloads and a “relaxed” atomic add helper.
Adjusts benchmarking defaults in test utilities and minor formatting/whitespace in tests/bindings.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`tests/utils.py`	Increases default Kineto iteration count for benchmark/profiling helper.
`tests/test_low_latency.py`	Minor formatting/whitespace change in the spawn call.
`csrc/kernels/utils.cuh`	Adds int64 overloads for acquire loads and introduces `atomic_add_relaxed_global`.
`csrc/kernels/internode_ll.cu`	Adds multinode template parameter, gates CTX usage, adjusts sync/wait logic, and updates ROCm warp-group settings.
`csrc/deep_ep.cpp`	Whitespace-only change at module close.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-31T00:18:31Z

csrc/kernels/utils.cuh

+__device__ __forceinline__ uint64_t ld_acquire_sys_global(const int64_t *ptr) {
+    int64_t ret;
+#ifdef USE_ROCM
+    ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_SYSTEM);
+#else
+    asm volatile("ld.acquire.sys.global.u64 %0, [%1];" : "=l"(ret) : "l"(ptr));
+#endif
+    return ret;


ld_acquire_sys_global(const int64_t*) returns uint64_t but stores into an int64_t and uses the .u64 load path. This should use a uint64_t (or change the function to return int64_t and use the signed variant consistently) to avoid sign/width mismatches and unexpected wrapping on return.

Copilot · 2026-01-31T00:18:31Z

csrc/kernels/utils.cuh

+#ifdef USE_ROCM
+    ret = __hip_atomic_fetch_add(const_cast<int*> (ptr), value, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_AGENT);
+#else
+    asm volatile("atom.add.release.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));


atomic_add_relaxed_global uses __ATOMIC_RELAXED on ROCm but emits atom.add.release... on the non-ROCm path. Either rename the function to atomic_add_release_global (and call that), or change the inline PTX to the relaxed variant so the implementation matches the name/intent.

Suggested change

asm volatile("atom.add.release.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));

asm volatile("atom.add.relaxed.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));

Copilot · 2026-01-31T00:18:31Z

csrc/kernels/utils.cuh

    return ret;
 }
-//inter
+//not used


The //not used comment above atomic_add_release_global is incorrect (this function is used in internode_ll.cu). Please update/remove the comment to avoid misleading future changes.

Suggested change

//not used

Copilot · 2026-01-31T00:18:32Z

csrc/kernels/internode_ll.cu

    if ((phases & LOW_LATENCY_RECV_PHASE) == 0)
+        if constexpr (multinode)
+            #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)
+                    internode::shmem_wg_ctx_destroy(&ctx);
+            #endif
        return;


Same issue in combine: defined(multinode) in the preprocessor condition is always false because multinode is a template parameter. As a result, the early-return path won’t destroy the ctx even when it was created. Please remove the defined(multinode) check and gate with if constexpr (multinode) + !defined(ROCM_DISABLE_CTX) instead.

Suggested change

if ((phases & LOW_LATENCY_RECV_PHASE) == 0)

if constexpr (multinode)

#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)

internode::shmem_wg_ctx_destroy(&ctx);

#endif

return;

if ((phases & LOW_LATENCY_RECV_PHASE) == 0) {

if constexpr (multinode) {

#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX)

internode::shmem_wg_ctx_destroy(&ctx);

#endif

}

return;

}

Copilot · 2026-01-31T00:18:32Z

tests/utils.py

+def bench_kineto(fn, kernel_names, num_tests: int = 100, suppress_kineto_output: bool = False,
                 trace_path: Optional[str] = None, barrier_comm_profiling: bool = False):


Changing bench_kineto default num_tests from 30 to 100 will significantly increase the runtime of tests/test_low_latency.py (it calls bench_kineto twice per run, plus the optional barrier/GEMM work). Consider keeping the default smaller and letting callers override it, or make it configurable via an env var/CLI arg so CI runtimes don’t regress unexpectedly.

Copilot · 2026-01-31T00:18:32Z

csrc/kernels/utils.cuh

+__device__ __forceinline__ int ld_acquire_global(const int64_t *ptr) {
+    int64_t ret;
+#ifdef USE_ROCM
+    ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT);
+#else    
+    asm volatile("ld.acquire.gpu.global.s32 %0, [%1];" : "=r"(ret) : "l"(ptr));


ld_acquire_global(const int64_t* ptr) is internally doing a 32-bit acquire load (ld.acquire.gpu.global.s32) and returns int even though it loads into an int64_t. On ROCm it also returns int despite using an int64_t atomic load. This will truncate 64-bit flags/counters (e.g., rdma_recv_count / rdma_recv_flag) and can break synchronization. Make this overload return int64_t and use the correct 64-bit load instruction/constraints on CUDA (.s64/.u64 with =l).

Suggested change

__device__ __forceinline__ int ld_acquire_global(const int64_t *ptr) {

int64_t ret;

#ifdef USE_ROCM

ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT);

#else

asm volatile("ld.acquire.gpu.global.s32 %0, [%1];" : "=r"(ret) : "l"(ptr));

__device__ __forceinline__ int64_t ld_acquire_global(const int64_t *ptr) {

int64_t ret;

#ifdef USE_ROCM

ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT);

#else

asm volatile("ld.acquire.gpu.global.s64 %0, [%1];" : "=l"(ret) : "l"(ptr));

Copilot · 2026-01-31T00:18:32Z

csrc/kernels/internode_ll.cu

    if ((phases & LOW_LATENCY_RECV_PHASE) == 0)
+        if constexpr (multinode)
+            #if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)
+                    internode::shmem_wg_ctx_destroy(&ctx);
+            #endif
        return;


The early-return cleanup uses #if ... && defined(multinode) but multinode is a template parameter, not a preprocessor macro, so defined(multinode) is always false. This means shmem_wg_ctx_destroy(&ctx) will never run on this early-return path, potentially leaking/dangling CTX resources when CTX is enabled. Remove the defined(multinode) preprocessor check and rely on if constexpr (multinode) (and !defined(ROCM_DISABLE_CTX)), or otherwise ensure destroy always runs when a ctx was created.

Suggested change

if ((phases & LOW_LATENCY_RECV_PHASE) == 0)

if constexpr (multinode)

#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)

internode::shmem_wg_ctx_destroy(&ctx);

#endif

return;

if ((phases & LOW_LATENCY_RECV_PHASE) == 0) {

#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX)

if constexpr (multinode) {

internode::shmem_wg_ctx_destroy(&ctx);

}

#endif

return;

}

Copilot · 2026-01-31T00:18:32Z

csrc/kernels/internode_ll.cu

 #if !defined(ROCM_DISABLE_CTX)
    __shared__ internode::shmem_ctx_t ctx;
-    internode::shmem_wg_ctx_create(&ctx);
+    if constexpr (multinode) {
+        EP_DEVICE_ASSERT(internode::shmem_wg_ctx_create(&ctx) == 0);
+    }


The PR motivation mentions enabling CTX for multinode, but this file unconditionally defines ROCM_DISABLE_CTX at the top, which compiles out all CTX paths (#if !defined(ROCM_DISABLE_CTX)). As a result, the new multinode CTX logic here won’t take effect. If CTX should be enabled for multinode, the ROCM_DISABLE_CTX define needs to be removed or made conditional so the shmem_wg_ctx_create/destroy code is actually compiled for that case.

Copilot · 2026-01-31T00:18:33Z

csrc/kernels/internode_ll.cu

 #endif
    constexpr int kNumMaxTopK = 9;
-    EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");
+    // EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");


Commenting out this EP_STATIC_ASSERT removes a useful compile-time safety check that the configured warp budget can handle kNumMaxTopK. If there’s no known compiler issue with it, it should remain enabled (it currently appears to hold for the chosen constants). If it must stay disabled, add a short comment explaining why.

Suggested change

// EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");

EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");

amirakb89 and others added 4 commits November 7, 2025 05:57

opt grid barrier

2f1e049

change groupsize combine

6fcf716

opt dispatch

603b03c

Enable CTX for multinode

bd16740

amd-wsung102 requested review from avbokovoy and liligwu November 24, 2025 21:13

liligwu requested review from RichardChamberlain1 and amd-weisun November 26, 2025 15:36

RichardChamberlain1 reviewed Nov 27, 2025

View reviewed changes

abouteiller reviewed Jan 30, 2026

View reviewed changes

abouteiller reviewed Jan 31, 2026

View reviewed changes

gaoikawa requested a review from Copilot January 31, 2026 00:13

Copilot started reviewing on behalf of gaoikawa January 31, 2026 00:14 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

	asm volatile("atom.add.release.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));
	asm volatile("atom.add.relaxed.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));

		def bench_kineto(fn, kernel_names, num_tests: int = 100, suppress_kineto_output: bool = False,
		trace_path: Optional[str] = None, barrier_comm_profiling: bool = False):

	// EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");
	EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");

Separate single/multinode and enable CTX #10

Are you sure you want to change the base?

Separate single/multinode and enable CTX #10

Uh oh!

Conversation

amd-wsung102 commented Nov 24, 2025

Motivation

Submission Checklist

Uh oh!

amd-wsung102 commented Nov 24, 2025

Uh oh!

RichardChamberlain1 left a comment

Choose a reason for hiding this comment

Uh oh!

amd-wsung102 commented Dec 3, 2025

Uh oh!

abouteiller Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

abouteiller Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

abouteiller Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants