RDNA4/gfx1201: HIP/ROCm Q2_0 ~13× slower than Vulkan (reproduced on ROCm 7.2.1 & 7.2.4)

## Summary

On **RDNA4 / gfx1201** (Radeon AI PRO R9700), the **HIP/ROCm** backend runs `Ternary-Bonsai-8B` **Q2_0** prefill **~13× slower than the Vulkan backend** built from the same tree (`prism` @ `c85e97a`). Vulkan is excellent; HIP is not. Full diagnosis below — it reproduces on **ROCm 7.2.1 and 7.2.4 (latest stable)**, and even a **scalar Vulkan** path (matrix cores disabled) beats ROCm's best by ~4×, which points past MMQ into the gfx1201 rocBLAS/Tensile stack.

## Environment
- GPU: 2× AMD Radeon AI PRO R9700 (**RADV GFX1201**), Vulkan 1.4.318
- llama.cpp: `PrismML-Eng/llama.cpp` `prism` branch @ `c85e97a` (PR #32 merged)
- Vulkan build: `-DGGML_VULKAN=ON` · HIP build: `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201`
- ROCm **7.2.1** and **7.2.4** (both tested); rocBLAS **5.2.0** in both
- Model: `prism-ml/Ternary-Bonsai-8B-gguf:Ternary-Bonsai-8B-Q2_0.gguf`

## Benchmark (`llama-bench -ngl 99 -p 512 -n 128`)

| Backend / ROCm | pp512 (t/s) | tg128 (t/s) |
|---|---|---|
| **Vulkan** (KHR cooperative-matrix) | **2775** | **132** |
| Vulkan, all coopmat **disabled** (scalar fp16) | 989 | — |
| ROCm 7.2.1 (HIP) | 250 | 99.5 |
| **ROCm 7.2.4** (HIP, latest stable) | 249 | 89 |

## Diagnosis (ruled out the usual suspects)
- **Not a missing macro:** `amdclang++ --cuda-device-only --offload-arch=gfx1201 -dM -E` defines `__GFX12__` → `RDNA4` + `AMD_WMMA_AVAILABLE` are ON.
- **WMMA is dispatched:** `rocprofv3` shows `void mul_mat_q<...>` = **59.8%** of GPU time (not a cuBLAS fallback).
- **Not fallback selection:** `GGML_CUDA_FORCE_MMQ` vs `GGML_CUDA_FORCE_CUBLAS` both ≈250.
- **Not clocks:** during compute the GPU sits at ~1340–1452 MHz / 48–76 W (cap 2350 MHz / 350 W); `rocm-smi --setperflevel high` doesn't change it (261 vs 250). The kernel stalls → low occupancy → DPM won't boost.
- **Not a stale ROCm:** reproduced on 7.2.4 (latest stable). rocBLAS is **5.2.0** in both 7.2.1 and 7.2.4, and the gfx1201 Tensile files are identical — notably the int8 GEMM is `TensileLibrary..._fallback_gfx1201.hsaco` (a generic fallback kernel, not perf-tuned).

## The key signal
Vulkan on radv advertises only `VK_KHR_cooperative_matrix` and `int dot: 0`, so it does **dequant Q2_0 → fp16 → KHR cooperative-matrix**. HIP uses the **integer MMQ** (quantize→int8→WMMA) path. On gfx1201, **even Vulkan's scalar fp16 path (989 t/s) is ~4× faster than ROCm's best (250)** — and ROCm's own `FORCE_CUBLAS` (rocBLAS fp16 GEMM) is also 250. So this isn't purely a `mul_mat_q` tuning issue; the **entire gfx1201 ROCm GEMM path (hand-written WMMA MMQ + vendor rocBLAS) is immature** on this new silicon.

## Reproduction
```bash
git clone -b prism https://github.com/PrismML-Eng/llama.cpp && cd llama.cpp
# Vulkan
cmake -B build -DGGML_VULKAN=ON && cmake --build build -j --target llama-bench
# HIP (gfx1201)
cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201 && cmake --build build-hip -j --target llama-bench
M=Ternary-Bonsai-8B-Q2_0.gguf
build/bin/llama-bench     -m $M -ngl 99 -p 512 -n 128   # Vulkan  ~2775 / 132
build-hip/bin/llama-bench -m $M -ngl 99 -p 512 -n 128   # ROCm    ~250  / 99
```

## Question / ask
Is RDNA4 HIP/ROCm optimization for Q2_0 on the roadmap, or should **Vulkan be the documented recommended backend for gfx1201**? Happy to share the full `rocprofv3` trace and test any patches. The scalar-Vulkan-beats-rocBLAS result suggests part of this belongs upstream in ROCm/rocBLAS (untuned gfx1201 Tensile) — I can file there too if useful.

*(Report prepared with tooling assistance; all numbers measured on the hardware above.)*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RDNA4/gfx1201: HIP/ROCm Q2_0 ~13× slower than Vulkan (reproduced on ROCm 7.2.1 & 7.2.4) #49

Summary

Environment

Benchmark (`llama-bench -ngl 99 -p 512 -n 128`)

Diagnosis (ruled out the usual suspects)

The key signal

Reproduction

Question / ask

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Backend / ROCm	pp512 (t/s)	tg128 (t/s)
Vulkan (KHR cooperative-matrix)	2775	132
Vulkan, all coopmat disabled (scalar fp16)	989	—
ROCm 7.2.1 (HIP)	250	99.5
ROCm 7.2.4 (HIP, latest stable)	249	89

Uh oh!

RDNA4/gfx1201: HIP/ROCm Q2_0 ~13× slower than Vulkan (reproduced on ROCm 7.2.1 & 7.2.4) #49

Description

Summary

Environment

Benchmark (llama-bench -ngl 99 -p 512 -n 128)

Diagnosis (ruled out the usual suspects)

The key signal

Reproduction

Question / ask

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Benchmark (`llama-bench -ngl 99 -p 512 -n 128`)