Skip to content

Conversation

@amd-wsung102
Copy link
Contributor

Motivation

This PR continues Amir's commit cde5f3d and enables CTX for multinode cases using the new template variables.

Submission Checklist

@amd-wsung102
Copy link
Contributor Author

The changes made in this PR include:

  • Using the existing code changes in this commit.
  • Enabling CTX in internode_ll.cu for multinode use cases. A multinode case is defined when the newly added template variable multinode is true, which depends on the number of ranks during runtime dispatch.

Copy link

@RichardChamberlain1 RichardChamberlain1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some general tidy up.

I ran the a single node test and it hung on inter-node, see here...

https://ml-ci-internal.amd.com/job/DeepEP/job/Experimental/101/console.

Has this been stress tested? And if so what docker image did you base it off?

@amd-wsung102
Copy link
Contributor Author

The docker image is based on rocm6.3.4_ubuntu24.04_py3.12_pytorch_release_2.4.0. This has not been stress tested. I am currently trying to stress test it on OCI but OCI is showing this error message permission denied while trying to connect to the docker API at unix:///var/run/docker.sock.

I will continue to try running it on OCI.

internode::shmem_fence();
#else
internode::shmem_ctx_quiet(ctx);
if constexpr (multinode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already inside the constexpr(multinode) line 284

if constexpr (multinode){ // does CTX depend on multinode?
internode::shmem_long_atomic_add( rdma_recv_count + dst_expert_local_idx * num_ranks + rank, -num_tokens_sent - 1, dst_rank);
}else{
rocshmem::rocshmem_long_p(rdma_recv_count + dst_expert_local_idx * num_ranks + rank, -num_tokens_sent - 1, dst_rank);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be rocshmem_long_atomic_set()? long_p will not issue a store with release semantic, unlike rocshmem_atomic_set.

__shared__ internode::shmem_ctx_t ctx;
internode::shmem_wg_ctx_create(&ctx);
if constexpr (multinode) {
EP_DEVICE_ASSERT(internode::shmem_wg_ctx_create(&ctx) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend fast-tracking this change as otherwise unexplainable crashes are generated when incorrect run parameters are supplied in env.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to separate single-node vs multi-node behavior in the low-latency internode kernels and enable CTX for multinode cases using new template variables.

Changes:

  • Adds multinode template specialization to dispatch/combine and gates CTX create/destroy and several communication paths on it.
  • Updates several low-level load/atomic helper utilities to add int64-related overloads and a “relaxed” atomic add helper.
  • Adjusts benchmarking defaults in test utilities and minor formatting/whitespace in tests/bindings.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/utils.py Increases default Kineto iteration count for benchmark/profiling helper.
tests/test_low_latency.py Minor formatting/whitespace change in the spawn call.
csrc/kernels/utils.cuh Adds int64 overloads for acquire loads and introduces atomic_add_relaxed_global.
csrc/kernels/internode_ll.cu Adds multinode template parameter, gates CTX usage, adjusts sync/wait logic, and updates ROCm warp-group settings.
csrc/deep_ep.cpp Whitespace-only change at module close.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +224 to +231
__device__ __forceinline__ uint64_t ld_acquire_sys_global(const int64_t *ptr) {
int64_t ret;
#ifdef USE_ROCM
ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_SYSTEM);
#else
asm volatile("ld.acquire.sys.global.u64 %0, [%1];" : "=l"(ret) : "l"(ptr));
#endif
return ret;
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ld_acquire_sys_global(const int64_t*) returns uint64_t but stores into an int64_t and uses the .u64 load path. This should use a uint64_t (or change the function to return int64_t and use the signed variant consistently) to avoid sign/width mismatches and unexpected wrapping on return.

Copilot uses AI. Check for mistakes.
#ifdef USE_ROCM
ret = __hip_atomic_fetch_add(const_cast<int*> (ptr), value, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_AGENT);
#else
asm volatile("atom.add.release.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic_add_relaxed_global uses __ATOMIC_RELAXED on ROCm but emits atom.add.release... on the non-ROCm path. Either rename the function to atomic_add_release_global (and call that), or change the inline PTX to the relaxed variant so the implementation matches the name/intent.

Suggested change
asm volatile("atom.add.release.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));
asm volatile("atom.add.relaxed.gpu.global.s32 %0, [%1], %2;" : "=r"(ret) : "l"(ptr), "r"(value));

Copilot uses AI. Check for mistakes.
return ret;
}
//inter
//not used
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The //not used comment above atomic_add_release_global is incorrect (this function is used in internode_ll.cu). Please update/remove the comment to avoid misleading future changes.

Suggested change
//not used

Copilot uses AI. Check for mistakes.
Comment on lines 620 to 625
if ((phases & LOW_LATENCY_RECV_PHASE) == 0)
if constexpr (multinode)
#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)
internode::shmem_wg_ctx_destroy(&ctx);
#endif
return;
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue in combine: defined(multinode) in the preprocessor condition is always false because multinode is a template parameter. As a result, the early-return path won’t destroy the ctx even when it was created. Please remove the defined(multinode) check and gate with if constexpr (multinode) + !defined(ROCM_DISABLE_CTX) instead.

Suggested change
if ((phases & LOW_LATENCY_RECV_PHASE) == 0)
if constexpr (multinode)
#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)
internode::shmem_wg_ctx_destroy(&ctx);
#endif
return;
if ((phases & LOW_LATENCY_RECV_PHASE) == 0) {
if constexpr (multinode) {
#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX)
internode::shmem_wg_ctx_destroy(&ctx);
#endif
}
return;
}

Copilot uses AI. Check for mistakes.
Comment on lines +146 to 147
def bench_kineto(fn, kernel_names, num_tests: int = 100, suppress_kineto_output: bool = False,
trace_path: Optional[str] = None, barrier_comm_profiling: bool = False):
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing bench_kineto default num_tests from 30 to 100 will significantly increase the runtime of tests/test_low_latency.py (it calls bench_kineto twice per run, plus the optional barrier/GEMM work). Consider keeping the default smaller and letting callers override it, or make it configurable via an env var/CLI arg so CI runtimes don’t regress unexpectedly.

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +248
__device__ __forceinline__ int ld_acquire_global(const int64_t *ptr) {
int64_t ret;
#ifdef USE_ROCM
ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT);
#else
asm volatile("ld.acquire.gpu.global.s32 %0, [%1];" : "=r"(ret) : "l"(ptr));
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ld_acquire_global(const int64_t* ptr) is internally doing a 32-bit acquire load (ld.acquire.gpu.global.s32) and returns int even though it loads into an int64_t. On ROCm it also returns int despite using an int64_t atomic load. This will truncate 64-bit flags/counters (e.g., rdma_recv_count / rdma_recv_flag) and can break synchronization. Make this overload return int64_t and use the correct 64-bit load instruction/constraints on CUDA (.s64/.u64 with =l).

Suggested change
__device__ __forceinline__ int ld_acquire_global(const int64_t *ptr) {
int64_t ret;
#ifdef USE_ROCM
ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT);
#else
asm volatile("ld.acquire.gpu.global.s32 %0, [%1];" : "=r"(ret) : "l"(ptr));
__device__ __forceinline__ int64_t ld_acquire_global(const int64_t *ptr) {
int64_t ret;
#ifdef USE_ROCM
ret = __hip_atomic_load(ptr, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT);
#else
asm volatile("ld.acquire.gpu.global.s64 %0, [%1];" : "=l"(ret) : "l"(ptr));

Copilot uses AI. Check for mistakes.
Comment on lines 335 to 340
if ((phases & LOW_LATENCY_RECV_PHASE) == 0)
if constexpr (multinode)
#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)
internode::shmem_wg_ctx_destroy(&ctx);
#endif
return;
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early-return cleanup uses #if ... && defined(multinode) but multinode is a template parameter, not a preprocessor macro, so defined(multinode) is always false. This means shmem_wg_ctx_destroy(&ctx) will never run on this early-return path, potentially leaking/dangling CTX resources when CTX is enabled. Remove the defined(multinode) preprocessor check and rely on if constexpr (multinode) (and !defined(ROCM_DISABLE_CTX)), or otherwise ensure destroy always runs when a ctx was created.

Suggested change
if ((phases & LOW_LATENCY_RECV_PHASE) == 0)
if constexpr (multinode)
#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX) && defined(multinode)
internode::shmem_wg_ctx_destroy(&ctx);
#endif
return;
if ((phases & LOW_LATENCY_RECV_PHASE) == 0) {
#if defined(USE_ROCM) && !defined(ROCM_DISABLE_CTX)
if constexpr (multinode) {
internode::shmem_wg_ctx_destroy(&ctx);
}
#endif
return;
}

Copilot uses AI. Check for mistakes.
Comment on lines 99 to +103
#if !defined(ROCM_DISABLE_CTX)
__shared__ internode::shmem_ctx_t ctx;
internode::shmem_wg_ctx_create(&ctx);
if constexpr (multinode) {
EP_DEVICE_ASSERT(internode::shmem_wg_ctx_create(&ctx) == 0);
}
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR motivation mentions enabling CTX for multinode, but this file unconditionally defines ROCM_DISABLE_CTX at the top, which compiles out all CTX paths (#if !defined(ROCM_DISABLE_CTX)). As a result, the new multinode CTX logic here won’t take effect. If CTX should be enabled for multinode, the ROCM_DISABLE_CTX define needs to be removed or made conditional so the shmem_wg_ctx_create/destroy code is actually compiled for that case.

Copilot uses AI. Check for mistakes.
#endif
constexpr int kNumMaxTopK = 9;
EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");
// EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out this EP_STATIC_ASSERT removes a useful compile-time safety check that the configured warp budget can handle kNumMaxTopK. If there’s no known compiler issue with it, it should remain enabled (it currently appears to hold for the chosen constants). If it must stay disabled, add a short comment explaining why.

Suggested change
// EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");
EP_STATIC_ASSERT(kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup, "Too many top-k selections");

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants