Unify MoE test perf timing to CUDA Event bench and fix minor issues#408
Unify MoE test perf timing to CUDA Event bench and fix minor issues#408
Conversation
- Remove run_perftest path from mxscale/wmma test harnesses, keep only _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and IQR outlier filtering) for consistent and accurate benchmarking - Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize from pytest tests since they were only used by the removed perftest path - Clean up __main__ CLI in wmma test to remove --test_graph/--benchmark args - Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250 - Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR standardizes MoE 2-stage kernel timing in the gfx1250 test harnesses by removing the legacy run_perftest/CUDA-graph benchmarking path and using a single CUDA-event-based per-iteration timer, while also fixing a couple of small architecture/tiling utilities.
Changes:
- Remove
run_perftest/test_graph/benchmark_modepaths from the WMMA and MXScale MoE gfx1250 harnesses and always time withbench_kernel_us. - Update tile resolution logic in
bench_resolve_tilesand broaden RDNA arch detection fromgfx120*togfx12*. - Simplify the WMMA harness CLI by removing obsolete
--benchmark/--test_graphoptions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/kernels/test_moe_gemm_wmma_gfx1250.py | Removes graph/perftest timing paths and CLI args; uses CUDA-event benchmark timer consistently. |
| tests/kernels/test_moe_gemm_mxscale_gfx1250.py | Removes graph/perftest timing paths; uses CUDA-event benchmark timer consistently. |
| tests/kernels/benchmark_common.py | Adjusts tile selection behavior for stage2 tile_k resolution. |
| python/flydsl/runtime/device.py | Updates RDNA arch detection to treat all gfx12* as RDNA. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| parser.add_argument("--benchmark", action="store_true", default=False, | ||
| help="Use GEMM-style per-iteration event timing with optional L2 flush.") | ||
| parser.add_argument("--no_flush_l2", action="store_true", default=False, | ||
| help="Disable L2 flush in benchmark mode.") |
There was a problem hiding this comment.
The --benchmark flag was removed, but the --no_flush_l2 help text still refers to “benchmark mode”. Since timing is now always done via the CUDA-event benchmark path, consider updating the help text to reflect the current behavior (e.g., “Disable L2 flush during timing”).
| help="Disable L2 flush in benchmark mode.") | |
| help="Disable L2 flush during timing.") |
| flush_l2=bool(flush_l2), | ||
| prep_fn=_prep_stage1, | ||
| ) | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
bench_kernel_us already synchronizes internally before returning. This extra torch.cuda.synchronize() is redundant and adds overhead; consider removing it (and the similar one after the stage2 timing call).
| torch.cuda.synchronize() |
| for k in range(target_k, 0, -wmma_k): | ||
| if inter_dim % k != 0: | ||
| # K-padding: run_moe_stage2 auto-pads inter_dim, so accept | ||
| # any tile_k that satisfies the load-mapping constraints. | ||
| total = tile_m * k | ||
| if total % 256 == 0 and (total // 256) % 4 == 0: | ||
| if tile_k2 is None: | ||
| tile_k2 = k | ||
| continue |
There was a problem hiding this comment.
bench_resolve_tiles can now select a tile_k2 even when inter_dim % tile_k2 != 0 (the new non-divisible branch sets tile_k2). Most run_moe_stage2 implementations in this repo (e.g. tests/kernels/test_moe_gemm.py) require inter_dim to be divisible by tile_k2 and will error out; in --bench mode this turns what used to be a clean SKIP into per-token FAILs. Consider only choosing tile_k2 values that divide inter_dim by default, and gate the non-divisible selection behind an explicit allow_k_padding flag / per-dtype capability for harnesses that actually pad K internally.
| flush_l2=bool(flush_l2), | ||
| prep_fn=_prep_stage1, | ||
| ) | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
bench_kernel_us already calls torch.cuda.synchronize() before returning the median latency. The extra torch.cuda.synchronize() here is redundant and adds overhead; removing it will make the harness slightly faster (same applies to the other _bench_kernel_us usage in this file).
| torch.cuda.synchronize() |
| if arch.startswith("gfx10") or arch.startswith("gfx11"): | ||
| return True | ||
| if arch.startswith("gfx120"): | ||
| if arch.startswith("gfx12"): |
There was a problem hiding this comment.
why rdna true for 1250?
- tdm_ops: add update_tensor_descriptor_2d_addr_lo /
update_tensor_gather_descriptor_addr_lo for in-place addr_lo advance,
and pin make_tensor_descriptor_2d's dgroup1 to an SSA value via
rocdl.readfirstlane(0) so LLVM cannot inline it as an <8 x i32>
literal at every tensor_load_to_lds call site (which made the AMDGPU
backend rematerialize 5-7 s_mov / s_brev per K-iter).
- moe_gemm_2stage_mxscale_gfx1250 (Stage1 + Stage2):
* Hoist make_desc_b / make_desc_b_pair / make_desc_bs / make_desc_bs_pair
out of the K loop into a _b_desc_cache; the hot path now only computes
the per-tile addr_lo and feeds it through update_tensor_descriptor_2d_addr_lo.
* Hoist the gather descriptor too: build dgroup1/2/3 once via
_build_a_gather_base_descs and advance addr_lo each iter via
update_tensor_gather_descriptor_addr_lo.
* Force _sum_i32_values's reduction onto an SGPR with rocdl.readfirstlane
so the gather _valid_count is not rematerialized as a VGPR sum chain
inside the scf.if guarding the gather issue (which had been triggering
a v_mov_b64 / v_readfirstlane round-trip of the whole 8-word header).
* Use range_constexpr instead of plain range when iterating over Python
lds_bufs lists inside kernel bodies, so the AST rewriter does not
promote the loop variable to an MLIR ArithValue.
- tests: harden run_moe_stage1 / run_moe_stage2 perf reporting against
us<=0 to avoid div-by-zero in the kernel-name printout when a path is
benchmarked with zero recorded time.
Effect on Stage1 ISA (model_dim=256, tile_k=128, smoke config):
ISA lines 1605 -> 1013 (-37%)
tensor_load_to_lds count 32 -> 8 (-75%)
per-iter dgroup1 rebuild SALU (s_brev / s_mov 0x40000 / 0x200000 / 0x4000 / 0x20000)
~40 -> 7 (the remainder are prologue + bg_pair<->bs_pair shape switch)
Both test_moe_stage1_use_tdm_gather_smoke and
test_moe_stage2_use_tdm_gather_smoke pass.
Made-with: Cursor
Made-with: Cursor
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist