Catch up to upstream/master (preserve all TurboQuant+ / MTP / contributor work) by TheTom · Pull Request #182 · TheTom/llama-cpp-turboquant

TheTom · 2026-06-14T17:34:21Z

Full catch-up merge of upstream/master into feature/turboquant-kv-cache. Merge (not rebase), so all fork commits, hashes, and authors are retained.

Verified on M5 Max (Metal)

Build green (full build, llama-cli + llama-quantize).
Turbo KV A/B vs f16 (gemma-4-12B-it-Q8_0): turbo4/turbo3 track f16, turbo2 coherent. Attn-rotation default-off policy holds.

Preserved (verified)

CUDA TurboQuant (@signalnine / Gabe Ortiz): turbo2/3/4, TQ4_1S/TQ3_1S, WHT, sparse-V, InnerQ, MLA, D=640 MMA FA
HIP: variadic shfl, VEC-force, HIP: fix turbo KV decode crash under graph capture; batch-aware VEC/TILE FA routing #176 graph-capture decode fix
Metal: TQ3-rotated mul_mm + turbo_wht
KV cache: auto-asymmetric turbo-K upgrade, default-off per-side attn-rotation, turbo head padding, n_layer_kv, +3 overhead
MTP: gemma4-assistant masked-embd, draft-MTP server multimodal, ctx_other
Server: get_slot_by_cache_key (unioned with upstream get_slot_by_cmpl_id)
Vulkan turbo3 KV-cache pipelines

Key resolutions

See TURBOQUANT_UPSTREAM_MERGE.md. Highlights: kv-cache ctor union (upstream shared-cells + fork auto-asymmetric + n_layer_kv); attn-rotation keeps fork policy + grafts upstream's DeepSeek-V3.2 indexer force; fattn-common.cuh takes upstream's graph-safe f16_extra (supersedes HIP pool workaround).

Build fixes (auto-merge dual-additions / API drift)

GGML_OP_COUNT 97->98 (TURBO_WHT); duplicate n_outputs_max; missing n_embd decl; duplicate get_suppress_tokens / PROJECTOR_TYPE_GEMMA4UA / clip_graph_gemma4uv; server get_available_slot reconciled.

TODO before relying on this (per-backend, untestable on M5)

AMD RDNA4: build Vulkan, reconcile shader-gen, re-port turbo3 Vulkan FA (deferred — upstream's FA stack evolved past the fork's; recovery commits in the doc), smoke-test turbo KV
5090 / CUDA: build CUDA, run turbo KV correctness + perf tests
M3 GGUF Config-I: with this catch-up + upstream MiniMax-M3 support (Preliminary MiniMax-M3 support ggml-org/llama.cpp#24523), quantize M3 to Config-I in GGUF

⚠️ Vulkan turbo3 flash-attention fast-path is deferred, not lost — taken to upstream's FA, re-port tracked for the AMD box. Turbo3 Vulkan KV cache itself is preserved.

)

…le (ggml-org#23167)

* ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type

* mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

removed AI-generated comment

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ci : releases use Github-hosted builds for the UI * cont : fix name

When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.

* run ui publish on self-hosted fast * run on ubuntu-slim

) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path

* mtmd-debug: add color and rainbow mode * fix M_PI * max_dist

…l-org#23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.

…gml-org#23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.

…-org#23825)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

[no release] Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>

* llama: use f16 mask for FA * review: add llama_cast + formatting * simplify

…se Attention (DSA) implementation (ggml-org#23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

* server: bump timeout to 3600s * nits: change wording

…23530) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX). Thus, one can have a bug when compiling with `DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode. This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code * Implement MurmurHash3 mixer for better hash distribution Magic constants were taken from boost: https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65 * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments, make seed non-zero * Apply code-formatting * Replace std::size_t -> size_t for consistency --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag

TheTom · 2026-06-16T22:22:37Z

Verification: M5 Max (Metal)

EAGLE3 (ggml-org#18039) status

No cherry-pick required. The upstream EAGLE3 squash merge (88a39274e) landed before this branch's catch-up merge (d75ef87ae), so the feature is already present in this PR. Confirmed intact, not just in history:

src/models/eagle3.cpp and common/speculative.cpp are byte-identical to upstream's merged version (0-line diff against 88a39274e).
eagle3 arch registered in src/llama-arch.cpp (LLM_ARCH_EAGLE3).
The remaining diffs in llama-context.cpp / llama-graph.cpp vs the upstream merge are TurboQuant + later-upstream changes, not EAGLE3 regressions.

Build

Full build green, AppleClang 21, build b9638-d75ef87ae:

llama-cli, llama-quantize, llama-server all built, exit 0, 0 errors.

Decode smoke (Metal, GPU offload)

gemma-4-12b-it-Q8_0, -ngl 99 -c 512 -st:

Prompt: 141.1 t/s | Generation: 33.8 t/s

Coherent output, clean exit. Metal backend confirmed working with the merged EAGLE3 code.

Vulkan and HIP/ROCm results to follow from the RDNA4 box.

TheTom · 2026-06-16T23:58:00Z

Verification: AMD RDNA4 (RX 9070 XT, gfx1201, ROCm 7.1) — Vulkan + HIP

Built from the same PR head (d75ef87ae) on the RDNA4 box. Configured with -DLLAMA_BUILD_SERVER=ON -DLLAMA_CURL=OFF. Note: the unified CLI target is now llama-app (binary llama.exe), and it unconditionally links the server + cli impl libs, so -DLLAMA_BUILD_SERVER=OFF breaks the llama.exe link (llama-server-impl.lib missing). Build with server ON.

HIP / ROCm: green

Full build clean: llama.exe, llama-quantize.exe, test-backend-ops.exe, all impl libs and ggml-hip.lib linked, HIP_BUILD_RC=0.

Device detected and inference works on GPU:

found 1 ROCm devices: AMD Radeon RX 9070 XT, gfx1201 (0x1201), Wave Size 32, VRAM 16304 MiB
llama.exe cli -m qwen2.5-1.5b-instruct-q8_0.gguf -ngl 99 -st
> The capital of France is Paris.
[ Prompt: 934.6 t/s | Generation: 173.8 t/s ]

One finding from test-backend-ops: it passes the standard op suite, then hard-aborts at GET_ROWS for the TurboQuant tq3_1s type:
```
ggml/src/ggml-cuda/getrows.cu:226: ggml_cuda_get_rows_switch_src0_type: unsupported src0 type: tq3_1s
(abort, exit 0xC0000409)
```
The HIP get_rows src0-type switch has no tq3_1s case, so instead of reporting the op as unsupported it calls GGML_ABORT. Likely the catch-up merge did not re-add the TQ3_1S case to upstream's restructured getrows.cu. Normal-model inference is unaffected (the smoke above is clean); this only bites the TQ3_1S get_rows path and crashes test-backend-ops before it can finish. Worth a follow-up: add the tq3_1s (and any sibling TQ types') case to ggml_cuda_get_rows_switch_src0_type, or fall through to unsupported instead of aborting.

Vulkan: blocked at shader-gen (the deferred turbo3 FA item)

CMake configures fine: Vulkan backend detected, glslc reports coopmat, GL_EXT_bfloat16, and integer-dot support.
The build then stalls indefinitely: glslc.exe hangs compiling vulkan-shaders/flash_attn.comp with -DDATA_A_TURBO3_0=1 (the turbo3 flash-attention variant). Confirmed multiple glslc processes pinned on that exact shader+define for 7+ minutes with zero progress, so the embedded-shader generation never completes and no backend objects build.
This matches the PR's own deferred TODO ("re-port turbo3 Vulkan FA, upstream's FA stack evolved past the fork's"). The turbo3 Vulkan FA shader needs reconciling against upstream's current flash_attn.comp before a Vulkan build can complete on this box. Not a surprise regression, but it does block the Vulkan target entirely as-is.

Summary

Backend	Build	Runtime
Metal (M5 Max)	green	decode OK, 141 / 33.8 t/s
HIP (gfx1201, ROCm 7.1)	green	decode OK, 934.6 / 173.8 t/s; `test-backend-ops` aborts on `tq3_1s` GET_ROWS
Vulkan (RDNA4)	blocked	shader-gen hangs on turbo3 `flash_attn.comp` (deferred re-port)

sztlink · 2026-06-17T00:01:33Z

CUDA results (RTX 4090, sm_89, CUDA 13.0, WSL2). Built green off tom/catchup-from-feature: cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89, llama-cli + llama-completion + ggml-cuda link clean.

Turbo KV A/B, Llama-3.1-8B-Instruct-Q4_K_M, -ngl 99, temp 0, single-turn, same prompt:

KV	coherence	decode tok/s	prefill tok/s
f16	baseline	151.9	1298
turbo4	tracks f16	146.7	1769
turbo3	tracks f16	146.3	1779
turbo2	coherent	150.4	1874

All four give consistent, coherent output; turbo4/3/2 track f16 (same topic and quality, no degeneration). Decode within ~3-4% of f16 (turbo2 ~= f16); prefill is actually faster on turbo. Attn-rotation left default-off. This is a short-context smoke (37-tok prompt / 160 gen), so correctness + no-regression rather than the long-context memory win. Happy to run long-context or other archs if useful. CUDA box checked.

apollo-mg · 2026-06-17T01:27:10Z

Merged into fresh compile on CUDA (4x nVidia P100's).

Qwen 3.6 35B MoE MTP Heretic Q5. KV Q8/Turbo4

Previous (PR 172 + Row Splitting):

Prompt Eval: 97.56 t/s
Decode Speed: 27.35 t/s

New (PR 182 + Layer Splitting):

Prompt Eval: 261.47 t/s (+168% faster)
Decode Speed: 56.86 t/s (+107% faster)
Total Time: Dropped from 236 seconds to 88 seconds.

Will test ROCM on RDNA4 next.

ggml_backend_cuda_device_supports_op reported GGML_OP_GET_ROWS as supported for TQ4_1S and TQ3_1S, but the get_rows src0-type switch had no cases for them, so any get_rows on a TurboQuant weight hit the default GGML_ABORT (test-backend-ops crashed at the tq3_1s case with 0xC0000409 on RDNA4/ROCm 7.1). Add the two cases using the existing per-pair dequant kernels (dequantize_tq4_1s / dequantize_tq3_1s, QR=1 -> 2 consecutive elements), mirroring how convert.cu wires the same dequant functions. Validated on RX 9070 XT (gfx1201, ROCm 7.1): test-backend-ops -o GET_ROWS now passes all tq3_1s and tq4_1s cases against the CPU reference (Backend ROCm0: OK).

TheTom · 2026-06-17T01:34:50Z

Fix pushed (`ed81ed03e`): CUDA/HIP get_rows for TQ4_1S and TQ3_1S

Root cause of the test-backend-ops abort: ggml_backend_cuda_device_supports_op lists GGML_OP_GET_ROWS as supported for both TQ4_1S and TQ3_1S (ggml-cuda.cu ~5500), but the get_rows src0-type switch in getrows.cu had no cases for them, so any get_rows on a TurboQuant weight fell through to default: GGML_ABORT. test-backend-ops trusts supports_op, invokes the op, and hard-crashes (0xC0000409). Not a merge regression: the pre-merge feature branch had no TQ cases either, so this path was never implemented.

Fix adds the two cases using the existing per-pair dequant kernels (dequantize_tq4_1s / dequantize_tq3_1s, QR=1), mirroring how convert.cu wires the same functions.

Validated on RX 9070 XT (gfx1201, ROCm 7.1):

test-backend-ops -o GET_ROWS
GET_ROWS(type=tq3_1s,...): OK   (x4)
GET_ROWS(type=tq4_1s,...): OK   (x4)
Backend ROCm0: OK

Vulkan turbo3 FA shader-gen hang is separate and still open (the deferred turbo3 FA re-port); investigating a build-unblock for it next.

sztlink · 2026-06-17T02:36:06Z

Confirmed ed81ed03e on the CUDA path (RTX 4090, sm_89, CUDA 13.0, WSL2), the platform untested above.

Before (d75ef87), test-backend-ops -o GET_ROWS aborts (SIGABRT, exit 134):

getrows.cu:226: ggml_cuda_get_rows_switch_src0_type: unsupported src0 type: tq3_1s

After (ed81ed03e, incremental rebuild, getrows.cu only):

GET_ROWS(type=tq3_1s, ...): OK   x4
GET_ROWS(type=tq4_1s, ...): OK   x4
55/55 tests passed
Backend CUDA0: OK

Same reproduce-then-fix you saw on RDNA4/HIP, holds on CUDA.

The upstream catch-up left the Vulkan flash-attention shaders unbuildable: flash_attn_dequant.glsl and flash_attn_cm1.comp had orphan #endif directives (the merge swapped the fork's turbo FA dequant for upstream's bf16 path but left dangling #endif from the old #if defined(DATA_A_TURBO3_0) blocks). glslc then failed/hung on every FA shader permutation, leaving empty generated .cpp files and an unresolved ggml-vulkan.dll link, so the whole Vulkan backend could not build. - Remove the 2 orphan #endif in each of flash_attn_dequant.glsl and flash_attn_cm1.comp to rebalance the preprocessor. - Stop generating the turbo3 FA SPIR-V variant in vulkan-shaders-gen (its dequant path was dropped by the merge; the variant is not wired into the runtime and glslc hangs on it). - Report TURBO2/3/4 K/V as unsupported for FLASH_ATTN_EXT in the Vulkan supports_op so they are skipped rather than dispatched to a missing pipeline (was producing inf -> test-backend-ops FAIL). Validated on RX 9070 XT (gfx1201): Vulkan build completes, GPU detected, llama-cli decode coherent (150/142 t/s), and test-backend-ops -o FLASH_ATTN_EXT has 0 turbo failures (now skipped). Known remaining: q8_0 K/V FA fails on this build (separate, pre-existing/merge FA-quant issue, under investigation).

TheTom · 2026-06-17T02:41:45Z

Fix pushed (`7d5ac40c2`): Vulkan FA shaders build again + turbo FA gated

Root cause of the earlier Vulkan "glslc hang / build blocked": the catch-up merge left the flash-attention shaders unbuildable. flash_attn_dequant.glsl and flash_attn_cm1.comp had orphan #endif directives, the merge swapped the fork's turbo FA dequant for upstream's bf16 path but left dangling #endifs from the old #if defined(DATA_A_TURBO3_0) blocks (flash_attn_dequant.glsl 1 #if / 3 #endif; flash_attn_cm1.comp 9 / 11). glslc then errored/spun on every FA permutation, producing 0-byte generated .cpp files and an unresolved ggml-vulkan.dll link, so the whole Vulkan backend could not build.

Changes:

Rebalanced the preprocessor: removed the 2 orphan #endif in each of flash_attn_dequant.glsl and flash_attn_cm1.comp.
Stopped generating the turbo3 FA SPIR-V variant in vulkan-shaders-gen.cpp (its dequant path was dropped by the merge, it is not wired into the runtime, and glslc hangs on it).
supports_op: report TURBO2/3/4 K/V as unsupported for FLASH_ATTN_EXT so they are skipped instead of dispatched to a missing pipeline (was producing inf -> FAIL).

Validated on RX 9070 XT (gfx1201, AMD proprietary driver, KHR_coopmat):

Vulkan build completes (was fully broken), llama.exe links.
GPU detected, llama.exe cli decode coherent: The capital of France is Paris. [ Prompt 150.0 t/s | Generation 142.1 t/s ].
test-backend-ops -o FLASH_ATTN_EXT: turbo FA failures 183 -> 0 (now correctly skipped); thousands of f16/f32/q4/q5 FA cases pass.

Known remaining (not fixed here): q8_0 K/V flash attention

test-backend-ops still shows ~17 FLASH_ATTN_EXT(type_K=q8_0,type_V=q8_0) failures (ERR=inf), including basic cases. The pre-catch-up fork's Vulkan build passed q8_0 FA, so this looks like a regression in the merged FA quantized-KV path, but I could not get a same-version clean-upstream Vulkan reference on this box to confirm whether it is merge-introduced or an upstream/AMD-driver limitation surfaced by newer test cases. Normal f16/f32 KV inference is unaffected. Flagging for follow-up.

q8_0 (and other quantized) K/V flash attention produced inf / -FLT_MAX on Vulkan with the default (f16-accumulation) precision: dequantized KV magnitudes overflow f16 accumulation at head size 128, corrupting the softmax. test-backend-ops -o FLASH_ATTN_EXT showed 22 FAIL for type_K=q8_0,type_V=q8_0 at prec=def while every prec=f32 case passed, isolating it to the f16 accumulation path. This matches the known Vulkan quantized-KV-cache + FA corruption reported upstream (incoherent output with --cache-type-k/v q8_0 on AMD). Treat quantized KV like BF16 in the f32acc decision in ggml_vk_flash_attn so the f32-accumulation shader is selected whenever K or V is quantized. Validated on RX 9070 XT (gfx1201, ROCm/Vulkan KHR_coopmat): - test-backend-ops -o FLASH_ATTN_EXT: 0 real failures (was 22 q8_0). - llama-cli decode with -fa on -ctk q8_0 -ctv q8_0: coherent output (529/103 t/s), no inf/garbage.

TheTom · 2026-06-17T03:16:19Z

Fix pushed (`f94c3b840`): q8_0 K/V flash attention on Vulkan — root-caused and resolved

Follow-up to the q8_0 K/V FA failure flagged earlier. Investigated with deep research + on-device bisection on the RX 9070 XT.

Correction to the earlier note: this is not a catch-up-merge regression. The old pre-catch-up fork's Vulkan test-backend-ops never exercised hsk=128 q8_0 FA (it only tested hsk=64), so its clean q8_0 result was not evidence the merge broke anything. The flash_attn_dequant.glsl q8_0 dequant path is byte-identical to upstream.

Root cause (isolated on-device): at hsk=128, all q8_0 K/V FA failures are prec=def (f16 accumulation); every prec=f32 case passes:

hsk=128 q8_0:  prec=f32 -> 153 OK / 0 FAIL
               prec=def -> 130 OK / 22 FAIL  (inf / -FLT_MAX)

Dequantized q8_0 KV magnitudes overflow f16 accumulation in the flash-attention softmax at head size 128, corrupting the result. This is the same mechanism behind the known upstream Vulkan corruption with --cache-type-k/v q8_0 (e.g. ggml-org#17995), inherited via the upstream FA shaders, and it affects upstream too.

Fix: in ggml_vk_flash_attn, force f32 accumulation whenever K or V is quantized (treat quantized KV like BF16 in the f32acc decision). One line.

Validated on RX 9070 XT (gfx1201, KHR_coopmat):

test-backend-ops -o FLASH_ATTN_EXT: 0 real failures (was 22 q8_0; turbo gated separately).

End-to-end decode with -fa on -ctk q8_0 -ctv q8_0: coherent output, no inf/garbage:

> List three facts about the moon.
1. The Moon is Earth's only natural satellite...
2. The Moon has no atmosphere...
[ Prompt 529.5 t/s | Generation 103.5 t/s ]

This is worth upstreaming separately, since plain upstream has the same f16-accumulation overflow for quantized KV flash attention.

TheTom · 2026-06-17T03:42:01Z

For posterity: Vulkan `MUL_MAT` crash on AMD is a driver bug, not a code issue

While validating the Vulkan backend, test-backend-ops -o MUL_MAT on the RX 9070 XT (gfx1201, AMD proprietary driver, KHR_coopmat) hard-crashes with an access violation (0xC0000005) at:

MUL_MAT(type_a=f16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3])

This is a process crash, not a correctness FAIL. Investigated and concluded it is an AMD proprietary Vulkan driver bug, not a llama.cpp / this-fork code issue:

Pre-existing: the old pre-catch-up fork build crashes at the identical config, so it is not introduced by the upstream catch-up or any fix in this PR.
Not a specific shader path: crashes with GGML_VK_DISABLE_COOPMAT=1 and with GGML_VK_DISABLE_F16=1 (and with both), so it is not the coopmat, f16, or any one matmul shader.
Matches an unresolved upstream report: Eval bug: AMD driver 25.11.1 crashes with llamacpp built with Vulkan SDK 1.4.328.1 ggml-org/llama.cpp#17432 (AMD proprietary driver + KHR_coopmat crash) where the core Vulkan maintainers could not fix it in code and GGML_VK_DISABLE_COOPMAT=1 did not help there either. RADV (Mesa) passes the same MUL_MAT tests (Misc. bug: test-backend-ops crashes on some tests ggml-org/llama.cpp#24521), confirming it is driver-specific.
No impact on real inference: llama-cli decode (including -fa on -ctk q8_0 -ctv q8_0) produces coherent output; only the synthetic test shape trips the driver.

Box driver at time of testing: AMD Adrenalin 32.0.23033.1002 (2026-03-08). The only known remedy (per upstream maintainers) is to try a newer/different AMD driver. No repo change applies; recording here so it is not re-investigated as a code regression.

The catch-up added GGML_OP_TURBO_WHT (GGML_OP_COUNT 97 -> 98) but left the static_assert(GGML_OP_COUNT == 97) in ggml-rpc.h, breaking every RPC-enabled build (arm64, ubuntu-rpc, macos, windows, ...) with a header static-assertion failure. Update the assert to 98 and bump RPC_PROTO_PATCH_VERSION (the op enum is part of the RPC wire protocol). Verified: ggml-rpc compiles with -DGGML_RPC=ON on M5 (Metal).

fl0rianr and others added 30 commits May 28, 2026 15:01

ggml: auto apply iGPU flag CUDA/HIP if integrated device (ggml-org#23007

30af6e2

)

test-llama-archs: fix table format [no release] (ggml-org#23810)

d374e71

arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-fi…

7fb1e70

…le (ggml-org#23167)

ci : change Vulkan builds to Release to reduce ccache (ggml-org#23820)

dd15579

* ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type

mtmd: fix gemma 4 audio rms norm eps (ggml-org#23815)

d6be315

* mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

mtmd: n_head_kv defaults to n_head (ggml-org#23782)

0b56d28

removed AI-generated comment

app : improve help output (ggml-org#23805)

479a9a1

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ci : releases use Github-hosted builds for the UI (ggml-org#23823)

445b7ce

* ci : releases use Github-hosted builds for the UI * cont : fix name

ui: fix audio and video modality detection (ggml-org#23756)

2f6c815

When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.

ci : run ui publish on ubuntu-slim (ggml-org#23818)

3ef2369

* run ui publish on self-hosted fast * run on ubuntu-slim

opencl: move backend info printing into its own function (ggml-org#23702

408ae2b

) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path

mtmd: fix gemma 4 projector pre_norm (ggml-org#23822)

c8914ad

mtmd-debug: add color and rainbow mode (ggml-org#23829)

751ebd1

* mtmd-debug: add color and rainbow mode * fix M_PI * max_dist

hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (ggm…

19e92c3

…l-org#23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.

meta : Add missing buffer set in allreduce fallback !COMPUTE clear (g…

33c718d

…gml-org#23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.

cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml…

241cbd4

…-org#23825)

app : move licences to llama-app (ggml-org#23824)

98e480a

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

llama: add llm_graph_input_mtp (ggml-org#23643)

eef59a7

* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngram-mod : Add missing include (ggml-org#23857)

b000431

[no release] Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>

ggml : bump version to 0.13.1 (ggml/1523)

ea02bc3

sync : ggml

fe12e42

llama: use f16 mask for FA to save VRAM (ggml-org#23764)

031ddb2

* llama: use f16 mask for FA * review: add llama_cast + formatting * simplify

server: bump timeout to 3600s (ggml-org#23842)

cb47092

* server: bump timeout to 3600s * nits: change wording

download: add option to skip_download (ggml-org#23059)

06d26df

* download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag

ci : update macos release to use macos-26 runner (ggml-org#23878)

dc71236

server: remove obsolete scripts (ggml-org#23870)

b5f5228

graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864)

764f1e6

github-actions Bot added testing devops python script model OpenCL SYCL build nix jinja parser Hexagon WebGPU AMD ZenDNN android server/ui CUDA labels Jun 14, 2026

TheTom merged commit 35ac80d into feature/turboquant-kv-cache Jun 17, 2026
41 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Catch up to upstream/master (preserve all TurboQuant+ / MTP / contributor work)#182

Catch up to upstream/master (preserve all TurboQuant+ / MTP / contributor work)#182
TheTom merged 447 commits into
feature/turboquant-kv-cachefrom
tom/catchup-from-feature

TheTom commented Jun 14, 2026

Uh oh!

TheTom commented Jun 16, 2026

Uh oh!

TheTom commented Jun 16, 2026

Uh oh!

sztlink commented Jun 17, 2026

Uh oh!

apollo-mg commented Jun 17, 2026 •

edited

Loading

Uh oh!

TheTom commented Jun 17, 2026

Uh oh!

sztlink commented Jun 17, 2026

Uh oh!

TheTom commented Jun 17, 2026

Uh oh!

TheTom commented Jun 17, 2026

Uh oh!

TheTom commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

TheTom commented Jun 14, 2026

Verified on M5 Max (Metal)

Preserved (verified)

Key resolutions

Build fixes (auto-merge dual-additions / API drift)

TODO before relying on this (per-backend, untestable on M5)

Uh oh!

TheTom commented Jun 16, 2026

Verification: M5 Max (Metal)

EAGLE3 (ggml-org#18039) status

Build

Decode smoke (Metal, GPU offload)

Uh oh!

TheTom commented Jun 16, 2026

Verification: AMD RDNA4 (RX 9070 XT, gfx1201, ROCm 7.1) — Vulkan + HIP

HIP / ROCm: green

Vulkan: blocked at shader-gen (the deferred turbo3 FA item)

Summary

Uh oh!

sztlink commented Jun 17, 2026

Uh oh!

apollo-mg commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented Jun 17, 2026

Fix pushed (ed81ed03e): CUDA/HIP get_rows for TQ4_1S and TQ3_1S

Uh oh!

sztlink commented Jun 17, 2026

Uh oh!

TheTom commented Jun 17, 2026

Fix pushed (7d5ac40c2): Vulkan FA shaders build again + turbo FA gated

Known remaining (not fixed here): q8_0 K/V flash attention

Uh oh!

TheTom commented Jun 17, 2026

Fix pushed (f94c3b840): q8_0 K/V flash attention on Vulkan — root-caused and resolved

Uh oh!

TheTom commented Jun 17, 2026

For posterity: Vulkan MUL_MAT crash on AMD is a driver bug, not a code issue

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

apollo-mg commented Jun 17, 2026 •

edited

Loading

Fix pushed (`ed81ed03e`): CUDA/HIP get_rows for TQ4_1S and TQ3_1S

Fix pushed (`7d5ac40c2`): Vulkan FA shaders build again + turbo FA gated

Fix pushed (`f94c3b840`): q8_0 K/V flash attention on Vulkan — root-caused and resolved

For posterity: Vulkan `MUL_MAT` crash on AMD is a driver bug, not a code issue