Catch up to upstream/master (preserve all TurboQuant+ / MTP / contributor work)#182
Conversation
* ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type
* mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
removed AI-generated comment
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ci : releases use Github-hosted builds for the UI * cont : fix name
When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.
* run ui publish on self-hosted fast * run on ubuntu-slim
* mtmd-debug: add color and rainbow mode * fix M_PI * max_dist
…l-org#23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
…gml-org#23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
[no release] Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>
* llama: use f16 mask for FA * review: add llama_cast + formatting * simplify
…se Attention (DSA) implementation (ggml-org#23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* server: bump timeout to 3600s * nits: change wording
…23530) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX). Thus, one can have a bug when compiling with `DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode. This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code * Implement MurmurHash3 mixer for better hash distribution Magic constants were taken from boost: https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65 * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments, make seed non-zero * Apply code-formatting * Replace std::size_t -> size_t for consistency --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag
Verification: M5 Max (Metal)EAGLE3 (ggml-org#18039) statusNo cherry-pick required. The upstream EAGLE3 squash merge (
BuildFull build green, AppleClang 21,
Decode smoke (Metal, GPU offload)
Coherent output, clean exit. Metal backend confirmed working with the merged EAGLE3 code. Vulkan and HIP/ROCm results to follow from the RDNA4 box. |
Verification: AMD RDNA4 (RX 9070 XT, gfx1201, ROCm 7.1) — Vulkan + HIPBuilt from the same PR head ( HIP / ROCm: green
Vulkan: blocked at shader-gen (the deferred turbo3 FA item)
Summary
|
|
CUDA results (RTX 4090, sm_89, CUDA 13.0, WSL2). Built green off Turbo KV A/B, Llama-3.1-8B-Instruct-Q4_K_M,
All four give consistent, coherent output; turbo4/3/2 track f16 (same topic and quality, no degeneration). Decode within ~3-4% of f16 (turbo2 ~= f16); prefill is actually faster on turbo. Attn-rotation left default-off. This is a short-context smoke (37-tok prompt / 160 gen), so correctness + no-regression rather than the long-context memory win. Happy to run long-context or other archs if useful. CUDA box checked. |
|
Merged into fresh compile on CUDA (4x nVidia P100's). Qwen 3.6 35B MoE MTP Heretic Q5. KV Q8/Turbo4 Previous (PR 172 + Row Splitting):
New (PR 182 + Layer Splitting):
Will test ROCM on RDNA4 next. |
ggml_backend_cuda_device_supports_op reported GGML_OP_GET_ROWS as supported for TQ4_1S and TQ3_1S, but the get_rows src0-type switch had no cases for them, so any get_rows on a TurboQuant weight hit the default GGML_ABORT (test-backend-ops crashed at the tq3_1s case with 0xC0000409 on RDNA4/ROCm 7.1). Add the two cases using the existing per-pair dequant kernels (dequantize_tq4_1s / dequantize_tq3_1s, QR=1 -> 2 consecutive elements), mirroring how convert.cu wires the same dequant functions. Validated on RX 9070 XT (gfx1201, ROCm 7.1): test-backend-ops -o GET_ROWS now passes all tq3_1s and tq4_1s cases against the CPU reference (Backend ROCm0: OK).
Fix pushed (
|
|
Confirmed Before ( After ( Same reproduce-then-fix you saw on RDNA4/HIP, holds on CUDA. |
The upstream catch-up left the Vulkan flash-attention shaders unbuildable: flash_attn_dequant.glsl and flash_attn_cm1.comp had orphan #endif directives (the merge swapped the fork's turbo FA dequant for upstream's bf16 path but left dangling #endif from the old #if defined(DATA_A_TURBO3_0) blocks). glslc then failed/hung on every FA shader permutation, leaving empty generated .cpp files and an unresolved ggml-vulkan.dll link, so the whole Vulkan backend could not build. - Remove the 2 orphan #endif in each of flash_attn_dequant.glsl and flash_attn_cm1.comp to rebalance the preprocessor. - Stop generating the turbo3 FA SPIR-V variant in vulkan-shaders-gen (its dequant path was dropped by the merge; the variant is not wired into the runtime and glslc hangs on it). - Report TURBO2/3/4 K/V as unsupported for FLASH_ATTN_EXT in the Vulkan supports_op so they are skipped rather than dispatched to a missing pipeline (was producing inf -> test-backend-ops FAIL). Validated on RX 9070 XT (gfx1201): Vulkan build completes, GPU detected, llama-cli decode coherent (150/142 t/s), and test-backend-ops -o FLASH_ATTN_EXT has 0 turbo failures (now skipped). Known remaining: q8_0 K/V FA fails on this build (separate, pre-existing/merge FA-quant issue, under investigation).
Fix pushed (
|
q8_0 (and other quantized) K/V flash attention produced inf / -FLT_MAX on Vulkan with the default (f16-accumulation) precision: dequantized KV magnitudes overflow f16 accumulation at head size 128, corrupting the softmax. test-backend-ops -o FLASH_ATTN_EXT showed 22 FAIL for type_K=q8_0,type_V=q8_0 at prec=def while every prec=f32 case passed, isolating it to the f16 accumulation path. This matches the known Vulkan quantized-KV-cache + FA corruption reported upstream (incoherent output with --cache-type-k/v q8_0 on AMD). Treat quantized KV like BF16 in the f32acc decision in ggml_vk_flash_attn so the f32-accumulation shader is selected whenever K or V is quantized. Validated on RX 9070 XT (gfx1201, ROCm/Vulkan KHR_coopmat): - test-backend-ops -o FLASH_ATTN_EXT: 0 real failures (was 22 q8_0). - llama-cli decode with -fa on -ctk q8_0 -ctv q8_0: coherent output (529/103 t/s), no inf/garbage.
Fix pushed (
|
For posterity: Vulkan
|
The catch-up added GGML_OP_TURBO_WHT (GGML_OP_COUNT 97 -> 98) but left the static_assert(GGML_OP_COUNT == 97) in ggml-rpc.h, breaking every RPC-enabled build (arm64, ubuntu-rpc, macos, windows, ...) with a header static-assertion failure. Update the assert to 98 and bump RPC_PROTO_PATCH_VERSION (the op enum is part of the RPC wire protocol). Verified: ggml-rpc compiles with -DGGML_RPC=ON on M5 (Metal).
Full catch-up merge of
upstream/masterintofeature/turboquant-kv-cache. Merge (not rebase), so all fork commits, hashes, and authors are retained.Verified on M5 Max (Metal)
Preserved (verified)
Key resolutions
See
TURBOQUANT_UPSTREAM_MERGE.md. Highlights: kv-cache ctor union (upstream shared-cells + fork auto-asymmetric + n_layer_kv); attn-rotation keeps fork policy + grafts upstream's DeepSeek-V3.2 indexer force;fattn-common.cuhtakes upstream's graph-safef16_extra(supersedes HIP pool workaround).Build fixes (auto-merge dual-additions / API drift)
GGML_OP_COUNT 97->98 (TURBO_WHT); duplicate
n_outputs_max; missingn_embddecl; duplicateget_suppress_tokens/PROJECTOR_TYPE_GEMMA4UA/clip_graph_gemma4uv; serverget_available_slotreconciled.TODO before relying on this (per-backend, untestable on M5)