You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On RDNA4 / gfx1201 (Radeon AI PRO R9700), the HIP/ROCm backend runs Ternary-Bonsai-8BQ2_0 prefill ~13× slower than the Vulkan backend built from the same tree (prism @ c85e97a). Vulkan is excellent; HIP is not. Full diagnosis below — it reproduces on ROCm 7.2.1 and 7.2.4 (latest stable), and even a scalar Vulkan path (matrix cores disabled) beats ROCm's best by ~4×, which points past MMQ into the gfx1201 rocBLAS/Tensile stack.
Environment
GPU: 2× AMD Radeon AI PRO R9700 (RADV GFX1201), Vulkan 1.4.318
Not a missing macro:amdclang++ --cuda-device-only --offload-arch=gfx1201 -dM -E defines __GFX12__ → RDNA4 + AMD_WMMA_AVAILABLE are ON.
WMMA is dispatched:rocprofv3 shows void mul_mat_q<...> = 59.8% of GPU time (not a cuBLAS fallback).
Not fallback selection:GGML_CUDA_FORCE_MMQ vs GGML_CUDA_FORCE_CUBLAS both ≈250.
Not clocks: during compute the GPU sits at ~1340–1452 MHz / 48–76 W (cap 2350 MHz / 350 W); rocm-smi --setperflevel high doesn't change it (261 vs 250). The kernel stalls → low occupancy → DPM won't boost.
Not a stale ROCm: reproduced on 7.2.4 (latest stable). rocBLAS is 5.2.0 in both 7.2.1 and 7.2.4, and the gfx1201 Tensile files are identical — notably the int8 GEMM is TensileLibrary..._fallback_gfx1201.hsaco (a generic fallback kernel, not perf-tuned).
The key signal
Vulkan on radv advertises only VK_KHR_cooperative_matrix and int dot: 0, so it does dequant Q2_0 → fp16 → KHR cooperative-matrix. HIP uses the integer MMQ (quantize→int8→WMMA) path. On gfx1201, even Vulkan's scalar fp16 path (989 t/s) is ~4× faster than ROCm's best (250) — and ROCm's own FORCE_CUBLAS (rocBLAS fp16 GEMM) is also 250. So this isn't purely a mul_mat_q tuning issue; the entire gfx1201 ROCm GEMM path (hand-written WMMA MMQ + vendor rocBLAS) is immature on this new silicon.
Is RDNA4 HIP/ROCm optimization for Q2_0 on the roadmap, or should Vulkan be the documented recommended backend for gfx1201? Happy to share the full rocprofv3 trace and test any patches. The scalar-Vulkan-beats-rocBLAS result suggests part of this belongs upstream in ROCm/rocBLAS (untuned gfx1201 Tensile) — I can file there too if useful.
(Report prepared with tooling assistance; all numbers measured on the hardware above.)
Summary
On RDNA4 / gfx1201 (Radeon AI PRO R9700), the HIP/ROCm backend runs
Ternary-Bonsai-8BQ2_0 prefill ~13× slower than the Vulkan backend built from the same tree (prism@c85e97a). Vulkan is excellent; HIP is not. Full diagnosis below — it reproduces on ROCm 7.2.1 and 7.2.4 (latest stable), and even a scalar Vulkan path (matrix cores disabled) beats ROCm's best by ~4×, which points past MMQ into the gfx1201 rocBLAS/Tensile stack.Environment
PrismML-Eng/llama.cppprismbranch @c85e97a(PR vulkan: Q2_0 #32 merged)-DGGML_VULKAN=ON· HIP build:-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201prism-ml/Ternary-Bonsai-8B-gguf:Ternary-Bonsai-8B-Q2_0.ggufBenchmark (
llama-bench -ngl 99 -p 512 -n 128)Diagnosis (ruled out the usual suspects)
amdclang++ --cuda-device-only --offload-arch=gfx1201 -dM -Edefines__GFX12__→RDNA4+AMD_WMMA_AVAILABLEare ON.rocprofv3showsvoid mul_mat_q<...>= 59.8% of GPU time (not a cuBLAS fallback).GGML_CUDA_FORCE_MMQvsGGML_CUDA_FORCE_CUBLASboth ≈250.rocm-smi --setperflevel highdoesn't change it (261 vs 250). The kernel stalls → low occupancy → DPM won't boost.TensileLibrary..._fallback_gfx1201.hsaco(a generic fallback kernel, not perf-tuned).The key signal
Vulkan on radv advertises only
VK_KHR_cooperative_matrixandint dot: 0, so it does dequant Q2_0 → fp16 → KHR cooperative-matrix. HIP uses the integer MMQ (quantize→int8→WMMA) path. On gfx1201, even Vulkan's scalar fp16 path (989 t/s) is ~4× faster than ROCm's best (250) — and ROCm's ownFORCE_CUBLAS(rocBLAS fp16 GEMM) is also 250. So this isn't purely amul_mat_qtuning issue; the entire gfx1201 ROCm GEMM path (hand-written WMMA MMQ + vendor rocBLAS) is immature on this new silicon.Reproduction
Question / ask
Is RDNA4 HIP/ROCm optimization for Q2_0 on the roadmap, or should Vulkan be the documented recommended backend for gfx1201? Happy to share the full
rocprofv3trace and test any patches. The scalar-Vulkan-beats-rocBLAS result suggests part of this belongs upstream in ROCm/rocBLAS (untuned gfx1201 Tensile) — I can file there too if useful.(Report prepared with tooling assistance; all numbers measured on the hardware above.)