Unify MoE test perf timing to CUDA Event bench and fix minor issues by XingerZhu · Pull Request #408 · ROCm/FlyDSL

XingerZhu · 2026-04-16T08:33:01Z

Remove run_perftest path from mxscale/wmma test harnesses, keep only _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and IQR outlier filtering) for consistent and accurate benchmarking
Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize from pytest tests since they were only used by the removed perftest path
Clean up main CLI in wmma test to remove --test_graph/--benchmark args
Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250
Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions

Made-with: Cursor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

- Remove run_perftest path from mxscale/wmma test harnesses, keep only _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and IQR outlier filtering) for consistent and accurate benchmarking - Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize from pytest tests since they were only used by the removed perftest path - Clean up __main__ CLI in wmma test to remove --test_graph/--benchmark args - Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250 - Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions Made-with: Cursor

Copilot

Pull request overview

This PR standardizes MoE 2-stage kernel timing in the gfx1250 test harnesses by removing the legacy run_perftest/CUDA-graph benchmarking path and using a single CUDA-event-based per-iteration timer, while also fixing a couple of small architecture/tiling utilities.

Changes:

Remove run_perftest/test_graph/benchmark_mode paths from the WMMA and MXScale MoE gfx1250 harnesses and always time with bench_kernel_us.
Update tile resolution logic in bench_resolve_tiles and broaden RDNA arch detection from gfx120* to gfx12*.
Simplify the WMMA harness CLI by removing obsolete --benchmark/--test_graph options.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
tests/kernels/test_moe_gemm_wmma_gfx1250.py	Removes graph/perftest timing paths and CLI args; uses CUDA-event benchmark timer consistently.
tests/kernels/test_moe_gemm_mxscale_gfx1250.py	Removes graph/perftest timing paths; uses CUDA-event benchmark timer consistently.
tests/kernels/benchmark_common.py	Adjusts tile selection behavior for stage2 `tile_k` resolution.
python/flydsl/runtime/device.py	Updates RDNA arch detection to treat all `gfx12*` as RDNA.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T08:38:51Z

-    parser.add_argument("--benchmark", action="store_true", default=False,
-                        help="Use GEMM-style per-iteration event timing with optional L2 flush.")
    parser.add_argument("--no_flush_l2", action="store_true", default=False,
                        help="Disable L2 flush in benchmark mode.")


The --benchmark flag was removed, but the --no_flush_l2 help text still refers to “benchmark mode”. Since timing is now always done via the CUDA-event benchmark path, consider updating the help text to reflect the current behavior (e.g., “Disable L2 flush during timing”).

Suggested change

help="Disable L2 flush in benchmark mode.")

help="Disable L2 flush during timing.")

Copilot · 2026-04-16T08:38:51Z

+        flush_l2=bool(flush_l2),
+        prep_fn=_prep_stage1,
+    )
    torch.cuda.synchronize()


bench_kernel_us already synchronizes internally before returning. This extra torch.cuda.synchronize() is redundant and adds overhead; consider removing it (and the similar one after the stage2 timing call).

Suggested change

torch.cuda.synchronize()

Copilot · 2026-04-16T08:38:51Z

    for k in range(target_k, 0, -wmma_k):
        if inter_dim % k != 0:
+            # K-padding: run_moe_stage2 auto-pads inter_dim, so accept
+            # any tile_k that satisfies the load-mapping constraints.
+            total = tile_m * k
+            if total % 256 == 0 and (total // 256) % 4 == 0:
+                if tile_k2 is None:
+                    tile_k2 = k
            continue


bench_resolve_tiles can now select a tile_k2 even when inter_dim % tile_k2 != 0 (the new non-divisible branch sets tile_k2). Most run_moe_stage2 implementations in this repo (e.g. tests/kernels/test_moe_gemm.py) require inter_dim to be divisible by tile_k2 and will error out; in --bench mode this turns what used to be a clean SKIP into per-token FAILs. Consider only choosing tile_k2 values that divide inter_dim by default, and gate the non-divisible selection behind an explicit allow_k_padding flag / per-dtype capability for harnesses that actually pad K internally.

Copilot · 2026-04-16T08:38:52Z

+        flush_l2=bool(flush_l2),
+        prep_fn=_prep_stage1,
+    )
    torch.cuda.synchronize()


bench_kernel_us already calls torch.cuda.synchronize() before returning the median latency. The extra torch.cuda.synchronize() here is redundant and adds overhead; removing it will make the harness slightly faster (same applies to the other _bench_kernel_us usage in this file).

Suggested change

torch.cuda.synchronize()

coderfeli · 2026-04-16T13:58:58Z

    if arch.startswith("gfx10") or arch.startswith("gfx11"):
        return True
-    if arch.startswith("gfx120"):
+    if arch.startswith("gfx12"):


why rdna true for 1250?

- tdm_ops: add update_tensor_descriptor_2d_addr_lo / update_tensor_gather_descriptor_addr_lo for in-place addr_lo advance, and pin make_tensor_descriptor_2d's dgroup1 to an SSA value via rocdl.readfirstlane(0) so LLVM cannot inline it as an <8 x i32> literal at every tensor_load_to_lds call site (which made the AMDGPU backend rematerialize 5-7 s_mov / s_brev per K-iter). - moe_gemm_2stage_mxscale_gfx1250 (Stage1 + Stage2): * Hoist make_desc_b / make_desc_b_pair / make_desc_bs / make_desc_bs_pair out of the K loop into a _b_desc_cache; the hot path now only computes the per-tile addr_lo and feeds it through update_tensor_descriptor_2d_addr_lo. * Hoist the gather descriptor too: build dgroup1/2/3 once via _build_a_gather_base_descs and advance addr_lo each iter via update_tensor_gather_descriptor_addr_lo. * Force _sum_i32_values's reduction onto an SGPR with rocdl.readfirstlane so the gather _valid_count is not rematerialized as a VGPR sum chain inside the scf.if guarding the gather issue (which had been triggering a v_mov_b64 / v_readfirstlane round-trip of the whole 8-word header). * Use range_constexpr instead of plain range when iterating over Python lds_bufs lists inside kernel bodies, so the AST rewriter does not promote the loop variable to an MLIR ArithValue. - tests: harden run_moe_stage1 / run_moe_stage2 perf reporting against us<=0 to avoid div-by-zero in the kernel-name printout when a path is benchmarked with zero recorded time. Effect on Stage1 ISA (model_dim=256, tile_k=128, smoke config): ISA lines 1605 -> 1013 (-37%) tensor_load_to_lds count 32 -> 8 (-75%) per-iter dgroup1 rebuild SALU (s_brev / s_mov 0x40000 / 0x200000 / 0x4000 / 0x20000) ~40 -> 7 (the remainder are prologue + bg_pair<->bs_pair shape switch) Both test_moe_stage1_use_tdm_gather_smoke and test_moe_stage2_use_tdm_gather_smoke pass. Made-with: Cursor

Copilot AI review requested due to automatic review settings April 16, 2026 08:33

Copilot started reviewing on behalf of XingerZhu April 16, 2026 08:34 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

coderfeli reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify MoE test perf timing to CUDA Event bench and fix minor issues#408

Unify MoE test perf timing to CUDA Event bench and fix minor issues#408
XingerZhu wants to merge 2 commits intomainfrom
gfx1250_moe_modify

XingerZhu commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

coderfeli Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	help="Disable L2 flush in benchmark mode.")
	help="Disable L2 flush during timing.")

Conversation

XingerZhu commented Apr 16, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants