Dsv4 metal by tarruda · Pull Request #3 · fairydreaming/llama.cpp

tarruda · 2026-07-03T20:46:39Z

This was mostly vibe coded by Deepseek V4 Pro, with the last commit being Codex fixing a lightning op metal bug introduced in the previous commits (plus a lightning op backend test).

On my M1 Ultra, this increases DSV4-flash token generation speed to ~20tps from ~6tps. I've been running this with this IQ3_XXS quant.

From my testing this looks like it is working well. Unlike the llama.cpp upstream implementation, it doesn't seem to have this bug, or at least I couldn't reproduce after a pi session that used more than 100k tokens in context.

IDK if you have restrictions against AI coded changes or if you are interested in merging. But since the dsv4 branch doesn't seem to be meant for a llama.cpp PR I thought it would be good to have it centralize implementation for more backends, which can be used as reference for future llama.cpp PRs.

Implements the DeepSeek V4 lightning indexer on Metal GPU. Follows the CUDA vec kernel approach with 8 SIMD groups per threadgroup, each processing 8 KV vectors using simd_sum for per-head dot product reduction. Supports F32, F16, BF16 and quantized K types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0). Assisted-by: DeepSeek V4 Pro

Implements GGML_OP_DSV4_HC_COMB, GGML_OP_DSV4_HC_PRE, and GGML_OP_DSV4_HC_POST on Metal GPU. HC_PRE performs weighted sum over hc slices, HC_COMB computes Sinkhorn-normalized combination matrices, and HC_POST blends input with residuals using the combination weights. All kernels operate on F32 tensors. Assisted-by: DeepSeek V4 Pro

The Metal lightning indexer assigned a scalar float expression to the float4 threadgroup q tile and derived the address from the packed embedding width. That ignored the q head stride and loaded the wrong q vector data, corrupting indexer scores for DeepSeek V4 on Metal. Load q tiles as strided float4 values, matching the CUDA path, and use the provided source and destination strides in HC_PRE instead of assuming contiguous row layout. Add a LIGHTNING_INDEXER backend-op case so the DSV4-shaped F32 path is checked against CPU. Assisted-by: Codex

tarruda added 3 commits July 3, 2026 15:36

github-actions Bot added ggml Apple Metal testing labels Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dsv4 metal#3

Dsv4 metal#3
tarruda wants to merge 3 commits into
fairydreaming:dsv4from
tarruda:dsv4-metal

tarruda commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tarruda commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant