Closed
Conversation
…x1250 Introduces the FlyDSL A16W16 GEMM kernel for RDNA4 (gfx1250) and integrates it as a first-class tunable backend in GemmTuner, alongside the existing splitk_hgemm and ASM paths. New files: - aiter/ops/flydsl/kernels/gemm_a16w16_gfx1250.py: WMMA 16x16x32 kernel using RDNA4 wave32; handles K-padding and N-stride internally; supports fp16/bf16 input, configurable tiling (tile_m/n/k), warp layout (m/n_warp), double-buffering (num_buffers), waves_per_eu, and L2 prefetch distance Changes to existing files: - aiter/ops/flydsl/gemm_kernels.py: add get_flydsl_a16w16_gfx1250_kernels() catalog and get_flydsl_a16w16_gfx1250_kernel_params() lookup; kernel name encodes all config parameters for reversible CSV serialisation - gradlib/gradlib/GemmTuner.py: import the new kernel; add run_flydsl_gemm_a16w16() run function; add flydsl_a16w16_gemm_all_sols() enumerator; route gfx1250 through the a16w16 path in run_asm_triton_sols() while other architectures continue using the existing splitk_hgemm path; also restores the ASM SplitK semaphore guard (gdx*gdy <= 1024) that was missing on main (also tracked in PR #2721) - aiter/tuned_gemm.py: add flydsl_a16w16_gemm() dispatch function; update the flydsl config lookup to resolve a16w16 kernel names, falling back to splitk_hgemm; select the correct call site based on the resolved config
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
| @@ -0,0 +1,857 @@ | |||
| # SPDX-License-Identifier: MIT | |||
| # Copyright (C) 2024-2026, Advanced Micro Devices, Inc. All rights reserved. | |||
| s | |||
Contributor
| # SPDX-License-Identifier: MIT | ||
| # Copyright (C) 2024-2026, Advanced Micro Devices, Inc. All rights reserved. | ||
| s | ||
| import torch |
Contributor
| # Copyright (C) 2024-2026, Advanced Micro Devices, Inc. All rights reserved. | ||
| s | ||
| import torch | ||
| import flydsl.compiler as flyc |
Contributor
| s | ||
| import torch | ||
| import flydsl.compiler as flyc | ||
| import flydsl.expr as fx |
Contributor
| import torch | ||
| import flydsl.compiler as flyc | ||
| import flydsl.expr as fx | ||
| from flydsl._mlir import ir |
Contributor
| from flydsl.compiler.kernel_function import CompilationContext | ||
| from flydsl.expr import arith, buffer_ops, gpu, range_constexpr, rocdl, tdm_ops, vector | ||
| from flydsl.expr.arith import _to_raw as _raw | ||
| from flydsl.expr.typing import T |
Contributor
| from flydsl.expr import arith, buffer_ops, gpu, range_constexpr, rocdl, tdm_ops, vector | ||
| from flydsl.expr.arith import _to_raw as _raw | ||
| from flydsl.expr.typing import T | ||
| from flydsl.runtime.device import get_rocm_arch as get_hip_arch |
Contributor
| from flydsl.expr.arith import _to_raw as _raw | ||
| from flydsl.expr.typing import T | ||
| from flydsl.runtime.device import get_rocm_arch as get_hip_arch | ||
| from flydsl.utils.smem_allocator import SmemAllocator, SmemPtr, get_op_result_or_value |
Contributor
| from flydsl.expr.typing import T | ||
| from flydsl.runtime.device import get_rocm_arch as get_hip_arch | ||
| from flydsl.utils.smem_allocator import SmemAllocator, SmemPtr, get_op_result_or_value | ||
| from flydsl.expr import idx2crd |
Contributor
| from flydsl.runtime.device import get_rocm_arch as get_hip_arch | ||
| from flydsl.utils.smem_allocator import SmemAllocator, SmemPtr, get_op_result_or_value | ||
| from flydsl.expr import idx2crd | ||
| from typing import Optional |
Contributor
…le candidates, warn on unresolved kernel names - flydsl_gemm() now passes stages/async_copy/c_to_lds from the stored catalog config to flydsl_hgemm(), matching what was benchmarked at tune time - flydsl_gemm_all_sols() skips tile_m configs larger than max(M, 16), reducing the candidate search space for small-M shapes - get_GEMM_A16W16_config() emits a warning when a stored FlyDSL kernel name cannot be resolved against the current catalog, instead of silently falling back to torch
Contributor
Author
|
This conflicts with another recent PR, changing strategy |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.