Skip to content

Catch up to upstream/master (preserve all TurboQuant+ / MTP / contributor work)#182

Merged
TheTom merged 447 commits into
feature/turboquant-kv-cachefrom
tom/catchup-from-feature
Jun 17, 2026
Merged

Catch up to upstream/master (preserve all TurboQuant+ / MTP / contributor work)#182
TheTom merged 447 commits into
feature/turboquant-kv-cachefrom
tom/catchup-from-feature

Conversation

@TheTom

@TheTom TheTom commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Full catch-up merge of upstream/master into feature/turboquant-kv-cache. Merge (not rebase), so all fork commits, hashes, and authors are retained.

Verified on M5 Max (Metal)

  • Build green (full build, llama-cli + llama-quantize).
  • Turbo KV A/B vs f16 (gemma-4-12B-it-Q8_0): turbo4/turbo3 track f16, turbo2 coherent. Attn-rotation default-off policy holds.

Preserved (verified)

  • CUDA TurboQuant (@signalnine / Gabe Ortiz): turbo2/3/4, TQ4_1S/TQ3_1S, WHT, sparse-V, InnerQ, MLA, D=640 MMA FA
  • HIP: variadic shfl, VEC-force, HIP: fix turbo KV decode crash under graph capture; batch-aware VEC/TILE FA routing #176 graph-capture decode fix
  • Metal: TQ3-rotated mul_mm + turbo_wht
  • KV cache: auto-asymmetric turbo-K upgrade, default-off per-side attn-rotation, turbo head padding, n_layer_kv, +3 overhead
  • MTP: gemma4-assistant masked-embd, draft-MTP server multimodal, ctx_other
  • Server: get_slot_by_cache_key (unioned with upstream get_slot_by_cmpl_id)
  • Vulkan turbo3 KV-cache pipelines

Key resolutions

See TURBOQUANT_UPSTREAM_MERGE.md. Highlights: kv-cache ctor union (upstream shared-cells + fork auto-asymmetric + n_layer_kv); attn-rotation keeps fork policy + grafts upstream's DeepSeek-V3.2 indexer force; fattn-common.cuh takes upstream's graph-safe f16_extra (supersedes HIP pool workaround).

Build fixes (auto-merge dual-additions / API drift)

GGML_OP_COUNT 97->98 (TURBO_WHT); duplicate n_outputs_max; missing n_embd decl; duplicate get_suppress_tokens / PROJECTOR_TYPE_GEMMA4UA / clip_graph_gemma4uv; server get_available_slot reconciled.

TODO before relying on this (per-backend, untestable on M5)

  • AMD RDNA4: build Vulkan, reconcile shader-gen, re-port turbo3 Vulkan FA (deferred — upstream's FA stack evolved past the fork's; recovery commits in the doc), smoke-test turbo KV
  • 5090 / CUDA: build CUDA, run turbo KV correctness + perf tests
  • M3 GGUF Config-I: with this catch-up + upstream MiniMax-M3 support (Preliminary MiniMax-M3 support ggml-org/llama.cpp#24523), quantize M3 to Config-I in GGUF

⚠️ Vulkan turbo3 flash-attention fast-path is deferred, not lost — taken to upstream's FA, re-port tracked for the AMD box. Turbo3 Vulkan KV cache itself is preserved.

fl0rianr and others added 30 commits May 28, 2026 15:01
* ci : disable all CPU variant builds for Vulkan workflow

* cont : change cache key

* cont : change build type
* mtmd: fix gemma 4 audio rms norm eps

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ci : releases use Github-hosted builds for the UI

* cont : fix name
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
* run ui publish on self-hosted fast

* run on ubuntu-slim
)

* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
* mtmd-debug: add color and rainbow mode

* fix M_PI

* max_dist
…l-org#23835)

Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
…gml-org#23480)

Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* llama: add llm_graph_input_mtp

* rename input_mtp -> input_token_embd

* add TODO about mtmd embedding

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
[no release]

Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>
* llama: use f16 mask for FA

* review: add llama_cast + formatting

* simplify
…se Attention (DSA) implementation (ggml-org#23346)

* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* server: bump timeout to 3600s

* nits: change wording
…23530)

* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

* introduced clip_image_f32::add_viewsep

* address PR review

- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d

* Update tools/mtmd/models/deepseekocr2.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
@TheTom

TheTom commented Jun 16, 2026

Copy link
Copy Markdown
Owner Author

Verification: M5 Max (Metal)

EAGLE3 (ggml-org#18039) status

No cherry-pick required. The upstream EAGLE3 squash merge (88a39274e) landed before this branch's catch-up merge (d75ef87ae), so the feature is already present in this PR. Confirmed intact, not just in history:

  • src/models/eagle3.cpp and common/speculative.cpp are byte-identical to upstream's merged version (0-line diff against 88a39274e).
  • eagle3 arch registered in src/llama-arch.cpp (LLM_ARCH_EAGLE3).
  • The remaining diffs in llama-context.cpp / llama-graph.cpp vs the upstream merge are TurboQuant + later-upstream changes, not EAGLE3 regressions.

Build

Full build green, AppleClang 21, build b9638-d75ef87ae:

  • llama-cli, llama-quantize, llama-server all built, exit 0, 0 errors.

Decode smoke (Metal, GPU offload)

gemma-4-12b-it-Q8_0, -ngl 99 -c 512 -st:

Prompt: 141.1 t/s | Generation: 33.8 t/s

Coherent output, clean exit. Metal backend confirmed working with the merged EAGLE3 code.

Vulkan and HIP/ROCm results to follow from the RDNA4 box.

@TheTom

TheTom commented Jun 16, 2026

Copy link
Copy Markdown
Owner Author

Verification: AMD RDNA4 (RX 9070 XT, gfx1201, ROCm 7.1) — Vulkan + HIP

Built from the same PR head (d75ef87ae) on the RDNA4 box. Configured with -DLLAMA_BUILD_SERVER=ON -DLLAMA_CURL=OFF. Note: the unified CLI target is now llama-app (binary llama.exe), and it unconditionally links the server + cli impl libs, so -DLLAMA_BUILD_SERVER=OFF breaks the llama.exe link (llama-server-impl.lib missing). Build with server ON.

HIP / ROCm: green

  • Full build clean: llama.exe, llama-quantize.exe, test-backend-ops.exe, all impl libs and ggml-hip.lib linked, HIP_BUILD_RC=0.
  • Device detected and inference works on GPU:
    found 1 ROCm devices: AMD Radeon RX 9070 XT, gfx1201 (0x1201), Wave Size 32, VRAM 16304 MiB
    llama.exe cli -m qwen2.5-1.5b-instruct-q8_0.gguf -ngl 99 -st
    > The capital of France is Paris.
    [ Prompt: 934.6 t/s | Generation: 173.8 t/s ]
    
  • One finding from test-backend-ops: it passes the standard op suite, then hard-aborts at GET_ROWS for the TurboQuant tq3_1s type:
    ggml/src/ggml-cuda/getrows.cu:226: ggml_cuda_get_rows_switch_src0_type: unsupported src0 type: tq3_1s
    (abort, exit 0xC0000409)
    
    The HIP get_rows src0-type switch has no tq3_1s case, so instead of reporting the op as unsupported it calls GGML_ABORT. Likely the catch-up merge did not re-add the TQ3_1S case to upstream's restructured getrows.cu. Normal-model inference is unaffected (the smoke above is clean); this only bites the TQ3_1S get_rows path and crashes test-backend-ops before it can finish. Worth a follow-up: add the tq3_1s (and any sibling TQ types') case to ggml_cuda_get_rows_switch_src0_type, or fall through to unsupported instead of aborting.

Vulkan: blocked at shader-gen (the deferred turbo3 FA item)

  • CMake configures fine: Vulkan backend detected, glslc reports coopmat, GL_EXT_bfloat16, and integer-dot support.
  • The build then stalls indefinitely: glslc.exe hangs compiling vulkan-shaders/flash_attn.comp with -DDATA_A_TURBO3_0=1 (the turbo3 flash-attention variant). Confirmed multiple glslc processes pinned on that exact shader+define for 7+ minutes with zero progress, so the embedded-shader generation never completes and no backend objects build.
  • This matches the PR's own deferred TODO ("re-port turbo3 Vulkan FA, upstream's FA stack evolved past the fork's"). The turbo3 Vulkan FA shader needs reconciling against upstream's current flash_attn.comp before a Vulkan build can complete on this box. Not a surprise regression, but it does block the Vulkan target entirely as-is.

Summary

Backend Build Runtime
Metal (M5 Max) green decode OK, 141 / 33.8 t/s
HIP (gfx1201, ROCm 7.1) green decode OK, 934.6 / 173.8 t/s; test-backend-ops aborts on tq3_1s GET_ROWS
Vulkan (RDNA4) blocked shader-gen hangs on turbo3 flash_attn.comp (deferred re-port)

@sztlink

sztlink commented Jun 17, 2026

Copy link
Copy Markdown

CUDA results (RTX 4090, sm_89, CUDA 13.0, WSL2). Built green off tom/catchup-from-feature: cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89, llama-cli + llama-completion + ggml-cuda link clean.

Turbo KV A/B, Llama-3.1-8B-Instruct-Q4_K_M, -ngl 99, temp 0, single-turn, same prompt:

KV coherence decode tok/s prefill tok/s
f16 baseline 151.9 1298
turbo4 tracks f16 146.7 1769
turbo3 tracks f16 146.3 1779
turbo2 coherent 150.4 1874

All four give consistent, coherent output; turbo4/3/2 track f16 (same topic and quality, no degeneration). Decode within ~3-4% of f16 (turbo2 ~= f16); prefill is actually faster on turbo. Attn-rotation left default-off. This is a short-context smoke (37-tok prompt / 160 gen), so correctness + no-regression rather than the long-context memory win. Happy to run long-context or other archs if useful. CUDA box checked.

@apollo-mg

apollo-mg commented Jun 17, 2026

Copy link
Copy Markdown

Merged into fresh compile on CUDA (4x nVidia P100's).

Qwen 3.6 35B MoE MTP Heretic Q5. KV Q8/Turbo4

Previous (PR 172 + Row Splitting):

  • Prompt Eval: 97.56 t/s
  • Decode Speed: 27.35 t/s

New (PR 182 + Layer Splitting):

  • Prompt Eval: 261.47 t/s (+168% faster)
  • Decode Speed: 56.86 t/s (+107% faster)
  • Total Time: Dropped from 236 seconds to 88 seconds.

Will test ROCM on RDNA4 next.

ggml_backend_cuda_device_supports_op reported GGML_OP_GET_ROWS as
supported for TQ4_1S and TQ3_1S, but the get_rows src0-type switch had
no cases for them, so any get_rows on a TurboQuant weight hit the
default GGML_ABORT (test-backend-ops crashed at the tq3_1s case with
0xC0000409 on RDNA4/ROCm 7.1).

Add the two cases using the existing per-pair dequant kernels
(dequantize_tq4_1s / dequantize_tq3_1s, QR=1 -> 2 consecutive
elements), mirroring how convert.cu wires the same dequant functions.

Validated on RX 9070 XT (gfx1201, ROCm 7.1): test-backend-ops -o
GET_ROWS now passes all tq3_1s and tq4_1s cases against the CPU
reference (Backend ROCm0: OK).
@TheTom

TheTom commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

Fix pushed (ed81ed03e): CUDA/HIP get_rows for TQ4_1S and TQ3_1S

Root cause of the test-backend-ops abort: ggml_backend_cuda_device_supports_op lists GGML_OP_GET_ROWS as supported for both TQ4_1S and TQ3_1S (ggml-cuda.cu ~5500), but the get_rows src0-type switch in getrows.cu had no cases for them, so any get_rows on a TurboQuant weight fell through to default: GGML_ABORT. test-backend-ops trusts supports_op, invokes the op, and hard-crashes (0xC0000409). Not a merge regression: the pre-merge feature branch had no TQ cases either, so this path was never implemented.

Fix adds the two cases using the existing per-pair dequant kernels (dequantize_tq4_1s / dequantize_tq3_1s, QR=1), mirroring how convert.cu wires the same functions.

Validated on RX 9070 XT (gfx1201, ROCm 7.1):

test-backend-ops -o GET_ROWS
GET_ROWS(type=tq3_1s,...): OK   (x4)
GET_ROWS(type=tq4_1s,...): OK   (x4)
Backend ROCm0: OK

Vulkan turbo3 FA shader-gen hang is separate and still open (the deferred turbo3 FA re-port); investigating a build-unblock for it next.

@sztlink

sztlink commented Jun 17, 2026

Copy link
Copy Markdown

Confirmed ed81ed03e on the CUDA path (RTX 4090, sm_89, CUDA 13.0, WSL2), the platform untested above.

Before (d75ef87), test-backend-ops -o GET_ROWS aborts (SIGABRT, exit 134):

getrows.cu:226: ggml_cuda_get_rows_switch_src0_type: unsupported src0 type: tq3_1s

After (ed81ed03e, incremental rebuild, getrows.cu only):

GET_ROWS(type=tq3_1s, ...): OK   x4
GET_ROWS(type=tq4_1s, ...): OK   x4
55/55 tests passed
Backend CUDA0: OK

Same reproduce-then-fix you saw on RDNA4/HIP, holds on CUDA.

The upstream catch-up left the Vulkan flash-attention shaders unbuildable:
flash_attn_dequant.glsl and flash_attn_cm1.comp had orphan #endif
directives (the merge swapped the fork's turbo FA dequant for upstream's
bf16 path but left dangling #endif from the old
#if defined(DATA_A_TURBO3_0) blocks). glslc then failed/hung on every FA
shader permutation, leaving empty generated .cpp files and an unresolved
ggml-vulkan.dll link, so the whole Vulkan backend could not build.

- Remove the 2 orphan #endif in each of flash_attn_dequant.glsl and
  flash_attn_cm1.comp to rebalance the preprocessor.
- Stop generating the turbo3 FA SPIR-V variant in vulkan-shaders-gen
  (its dequant path was dropped by the merge; the variant is not wired
  into the runtime and glslc hangs on it).
- Report TURBO2/3/4 K/V as unsupported for FLASH_ATTN_EXT in the Vulkan
  supports_op so they are skipped rather than dispatched to a missing
  pipeline (was producing inf -> test-backend-ops FAIL).

Validated on RX 9070 XT (gfx1201): Vulkan build completes, GPU detected,
llama-cli decode coherent (150/142 t/s), and test-backend-ops -o
FLASH_ATTN_EXT has 0 turbo failures (now skipped). Known remaining: q8_0
K/V FA fails on this build (separate, pre-existing/merge FA-quant issue,
under investigation).
@TheTom

TheTom commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

Fix pushed (7d5ac40c2): Vulkan FA shaders build again + turbo FA gated

Root cause of the earlier Vulkan "glslc hang / build blocked": the catch-up merge left the flash-attention shaders unbuildable. flash_attn_dequant.glsl and flash_attn_cm1.comp had orphan #endif directives, the merge swapped the fork's turbo FA dequant for upstream's bf16 path but left dangling #endifs from the old #if defined(DATA_A_TURBO3_0) blocks (flash_attn_dequant.glsl 1 #if / 3 #endif; flash_attn_cm1.comp 9 / 11). glslc then errored/spun on every FA permutation, producing 0-byte generated .cpp files and an unresolved ggml-vulkan.dll link, so the whole Vulkan backend could not build.

Changes:

  • Rebalanced the preprocessor: removed the 2 orphan #endif in each of flash_attn_dequant.glsl and flash_attn_cm1.comp.
  • Stopped generating the turbo3 FA SPIR-V variant in vulkan-shaders-gen.cpp (its dequant path was dropped by the merge, it is not wired into the runtime, and glslc hangs on it).
  • supports_op: report TURBO2/3/4 K/V as unsupported for FLASH_ATTN_EXT so they are skipped instead of dispatched to a missing pipeline (was producing inf -> FAIL).

Validated on RX 9070 XT (gfx1201, AMD proprietary driver, KHR_coopmat):

  • Vulkan build completes (was fully broken), llama.exe links.
  • GPU detected, llama.exe cli decode coherent: The capital of France is Paris. [ Prompt 150.0 t/s | Generation 142.1 t/s ].
  • test-backend-ops -o FLASH_ATTN_EXT: turbo FA failures 183 -> 0 (now correctly skipped); thousands of f16/f32/q4/q5 FA cases pass.

Known remaining (not fixed here): q8_0 K/V flash attention

test-backend-ops still shows ~17 FLASH_ATTN_EXT(type_K=q8_0,type_V=q8_0) failures (ERR=inf), including basic cases. The pre-catch-up fork's Vulkan build passed q8_0 FA, so this looks like a regression in the merged FA quantized-KV path, but I could not get a same-version clean-upstream Vulkan reference on this box to confirm whether it is merge-introduced or an upstream/AMD-driver limitation surfaced by newer test cases. Normal f16/f32 KV inference is unaffected. Flagging for follow-up.

q8_0 (and other quantized) K/V flash attention produced inf / -FLT_MAX on
Vulkan with the default (f16-accumulation) precision: dequantized KV
magnitudes overflow f16 accumulation at head size 128, corrupting the
softmax. test-backend-ops -o FLASH_ATTN_EXT showed 22 FAIL for
type_K=q8_0,type_V=q8_0 at prec=def while every prec=f32 case passed,
isolating it to the f16 accumulation path. This matches the known Vulkan
quantized-KV-cache + FA corruption reported upstream (incoherent output
with --cache-type-k/v q8_0 on AMD).

Treat quantized KV like BF16 in the f32acc decision in ggml_vk_flash_attn
so the f32-accumulation shader is selected whenever K or V is quantized.

Validated on RX 9070 XT (gfx1201, ROCm/Vulkan KHR_coopmat):
- test-backend-ops -o FLASH_ATTN_EXT: 0 real failures (was 22 q8_0).
- llama-cli decode with -fa on -ctk q8_0 -ctv q8_0: coherent output
  (529/103 t/s), no inf/garbage.
@TheTom

TheTom commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

Fix pushed (f94c3b840): q8_0 K/V flash attention on Vulkan — root-caused and resolved

Follow-up to the q8_0 K/V FA failure flagged earlier. Investigated with deep research + on-device bisection on the RX 9070 XT.

Correction to the earlier note: this is not a catch-up-merge regression. The old pre-catch-up fork's Vulkan test-backend-ops never exercised hsk=128 q8_0 FA (it only tested hsk=64), so its clean q8_0 result was not evidence the merge broke anything. The flash_attn_dequant.glsl q8_0 dequant path is byte-identical to upstream.

Root cause (isolated on-device): at hsk=128, all q8_0 K/V FA failures are prec=def (f16 accumulation); every prec=f32 case passes:

hsk=128 q8_0:  prec=f32 -> 153 OK / 0 FAIL
               prec=def -> 130 OK / 22 FAIL  (inf / -FLT_MAX)

Dequantized q8_0 KV magnitudes overflow f16 accumulation in the flash-attention softmax at head size 128, corrupting the result. This is the same mechanism behind the known upstream Vulkan corruption with --cache-type-k/v q8_0 (e.g. ggml-org#17995), inherited via the upstream FA shaders, and it affects upstream too.

Fix: in ggml_vk_flash_attn, force f32 accumulation whenever K or V is quantized (treat quantized KV like BF16 in the f32acc decision). One line.

Validated on RX 9070 XT (gfx1201, KHR_coopmat):

  • test-backend-ops -o FLASH_ATTN_EXT: 0 real failures (was 22 q8_0; turbo gated separately).
  • End-to-end decode with -fa on -ctk q8_0 -ctv q8_0: coherent output, no inf/garbage:
    > List three facts about the moon.
    1. The Moon is Earth's only natural satellite...
    2. The Moon has no atmosphere...
    [ Prompt 529.5 t/s | Generation 103.5 t/s ]
    

This is worth upstreaming separately, since plain upstream has the same f16-accumulation overflow for quantized KV flash attention.

@TheTom

TheTom commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

For posterity: Vulkan MUL_MAT crash on AMD is a driver bug, not a code issue

While validating the Vulkan backend, test-backend-ops -o MUL_MAT on the RX 9070 XT (gfx1201, AMD proprietary driver, KHR_coopmat) hard-crashes with an access violation (0xC0000005) at:

MUL_MAT(type_a=f16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3])

This is a process crash, not a correctness FAIL. Investigated and concluded it is an AMD proprietary Vulkan driver bug, not a llama.cpp / this-fork code issue:

  • Pre-existing: the old pre-catch-up fork build crashes at the identical config, so it is not introduced by the upstream catch-up or any fix in this PR.
  • Not a specific shader path: crashes with GGML_VK_DISABLE_COOPMAT=1 and with GGML_VK_DISABLE_F16=1 (and with both), so it is not the coopmat, f16, or any one matmul shader.
  • Matches an unresolved upstream report: Eval bug: AMD driver 25.11.1 crashes with llamacpp built with Vulkan SDK 1.4.328.1 ggml-org/llama.cpp#17432 (AMD proprietary driver + KHR_coopmat crash) where the core Vulkan maintainers could not fix it in code and GGML_VK_DISABLE_COOPMAT=1 did not help there either. RADV (Mesa) passes the same MUL_MAT tests (Misc. bug: test-backend-ops crashes on some tests ggml-org/llama.cpp#24521), confirming it is driver-specific.
  • No impact on real inference: llama-cli decode (including -fa on -ctk q8_0 -ctv q8_0) produces coherent output; only the synthetic test shape trips the driver.

Box driver at time of testing: AMD Adrenalin 32.0.23033.1002 (2026-03-08). The only known remedy (per upstream maintainers) is to try a newer/different AMD driver. No repo change applies; recording here so it is not re-investigated as a code regression.

The catch-up added GGML_OP_TURBO_WHT (GGML_OP_COUNT 97 -> 98) but left the
static_assert(GGML_OP_COUNT == 97) in ggml-rpc.h, breaking every
RPC-enabled build (arm64, ubuntu-rpc, macos, windows, ...) with a header
static-assertion failure. Update the assert to 98 and bump
RPC_PROTO_PATCH_VERSION (the op enum is part of the RPC wire protocol).

Verified: ggml-rpc compiles with -DGGML_RPC=ON on M5 (Metal).
@TheTom TheTom merged commit 35ac80d into feature/turboquant-kv-cache Jun 17, 2026
41 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.