Skip to content

RDNA4/gfx1201: HIP/ROCm Q2_0 ~13× slower than Vulkan (reproduced on ROCm 7.2.1 & 7.2.4) #49

Description

@The-Monk

Summary

On RDNA4 / gfx1201 (Radeon AI PRO R9700), the HIP/ROCm backend runs Ternary-Bonsai-8B Q2_0 prefill ~13× slower than the Vulkan backend built from the same tree (prism @ c85e97a). Vulkan is excellent; HIP is not. Full diagnosis below — it reproduces on ROCm 7.2.1 and 7.2.4 (latest stable), and even a scalar Vulkan path (matrix cores disabled) beats ROCm's best by ~4×, which points past MMQ into the gfx1201 rocBLAS/Tensile stack.

Environment

  • GPU: 2× AMD Radeon AI PRO R9700 (RADV GFX1201), Vulkan 1.4.318
  • llama.cpp: PrismML-Eng/llama.cpp prism branch @ c85e97a (PR vulkan: Q2_0 #32 merged)
  • Vulkan build: -DGGML_VULKAN=ON · HIP build: -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201
  • ROCm 7.2.1 and 7.2.4 (both tested); rocBLAS 5.2.0 in both
  • Model: prism-ml/Ternary-Bonsai-8B-gguf:Ternary-Bonsai-8B-Q2_0.gguf

Benchmark (llama-bench -ngl 99 -p 512 -n 128)

Backend / ROCm pp512 (t/s) tg128 (t/s)
Vulkan (KHR cooperative-matrix) 2775 132
Vulkan, all coopmat disabled (scalar fp16) 989
ROCm 7.2.1 (HIP) 250 99.5
ROCm 7.2.4 (HIP, latest stable) 249 89

Diagnosis (ruled out the usual suspects)

  • Not a missing macro: amdclang++ --cuda-device-only --offload-arch=gfx1201 -dM -E defines __GFX12__RDNA4 + AMD_WMMA_AVAILABLE are ON.
  • WMMA is dispatched: rocprofv3 shows void mul_mat_q<...> = 59.8% of GPU time (not a cuBLAS fallback).
  • Not fallback selection: GGML_CUDA_FORCE_MMQ vs GGML_CUDA_FORCE_CUBLAS both ≈250.
  • Not clocks: during compute the GPU sits at ~1340–1452 MHz / 48–76 W (cap 2350 MHz / 350 W); rocm-smi --setperflevel high doesn't change it (261 vs 250). The kernel stalls → low occupancy → DPM won't boost.
  • Not a stale ROCm: reproduced on 7.2.4 (latest stable). rocBLAS is 5.2.0 in both 7.2.1 and 7.2.4, and the gfx1201 Tensile files are identical — notably the int8 GEMM is TensileLibrary..._fallback_gfx1201.hsaco (a generic fallback kernel, not perf-tuned).

The key signal

Vulkan on radv advertises only VK_KHR_cooperative_matrix and int dot: 0, so it does dequant Q2_0 → fp16 → KHR cooperative-matrix. HIP uses the integer MMQ (quantize→int8→WMMA) path. On gfx1201, even Vulkan's scalar fp16 path (989 t/s) is ~4× faster than ROCm's best (250) — and ROCm's own FORCE_CUBLAS (rocBLAS fp16 GEMM) is also 250. So this isn't purely a mul_mat_q tuning issue; the entire gfx1201 ROCm GEMM path (hand-written WMMA MMQ + vendor rocBLAS) is immature on this new silicon.

Reproduction

git clone -b prism https://github.com/PrismML-Eng/llama.cpp && cd llama.cpp
# Vulkan
cmake -B build -DGGML_VULKAN=ON && cmake --build build -j --target llama-bench
# HIP (gfx1201)
cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201 && cmake --build build-hip -j --target llama-bench
M=Ternary-Bonsai-8B-Q2_0.gguf
build/bin/llama-bench     -m $M -ngl 99 -p 512 -n 128   # Vulkan  ~2775 / 132
build-hip/bin/llama-bench -m $M -ngl 99 -p 512 -n 128   # ROCm    ~250  / 99

Question / ask

Is RDNA4 HIP/ROCm optimization for Q2_0 on the roadmap, or should Vulkan be the documented recommended backend for gfx1201? Happy to share the full rocprofv3 trace and test any patches. The scalar-Vulkan-beats-rocBLAS result suggests part of this belongs upstream in ROCm/rocBLAS (untuned gfx1201 Tensile) — I can file there too if useful.

(Report prepared with tooling assistance; all numbers measured on the hardware above.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions