Skip to content

Dsv4 metal#3

Open
tarruda wants to merge 3 commits into
fairydreaming:dsv4from
tarruda:dsv4-metal
Open

Dsv4 metal#3
tarruda wants to merge 3 commits into
fairydreaming:dsv4from
tarruda:dsv4-metal

Conversation

@tarruda

@tarruda tarruda commented Jul 3, 2026

Copy link
Copy Markdown

This was mostly vibe coded by Deepseek V4 Pro, with the last commit being Codex fixing a lightning op metal bug introduced in the previous commits (plus a lightning op backend test).

On my M1 Ultra, this increases DSV4-flash token generation speed to ~20tps from ~6tps. I've been running this with this IQ3_XXS quant.

From my testing this looks like it is working well. Unlike the llama.cpp upstream implementation, it doesn't seem to have this bug, or at least I couldn't reproduce after a pi session that used more than 100k tokens in context.

IDK if you have restrictions against AI coded changes or if you are interested in merging. But since the dsv4 branch doesn't seem to be meant for a llama.cpp PR I thought it would be good to have it centralize implementation for more backends, which can be used as reference for future llama.cpp PRs.

tarruda added 3 commits July 3, 2026 15:36
Implements the DeepSeek V4 lightning indexer on Metal GPU. Follows
the CUDA vec kernel approach with 8 SIMD groups per threadgroup,
each processing 8 KV vectors using simd_sum for per-head dot product
reduction. Supports F32, F16, BF16 and quantized K types (Q4_0,
Q4_1, Q5_0, Q5_1, Q8_0).

Assisted-by: DeepSeek V4 Pro
Implements GGML_OP_DSV4_HC_COMB, GGML_OP_DSV4_HC_PRE, and
GGML_OP_DSV4_HC_POST on Metal GPU. HC_PRE performs weighted sum
over hc slices, HC_COMB computes Sinkhorn-normalized combination
matrices, and HC_POST blends input with residuals using the
combination weights. All kernels operate on F32 tensors.

Assisted-by: DeepSeek V4 Pro
The Metal lightning indexer assigned a scalar float expression to the float4
threadgroup q tile and derived the address from the packed embedding width.
That ignored the q head stride and loaded the wrong q vector data, corrupting
indexer scores for DeepSeek V4 on Metal.

Load q tiles as strided float4 values, matching the CUDA path, and use the
provided source and destination strides in HC_PRE instead of assuming
contiguous row layout.

Add a LIGHTNING_INDEXER backend-op case so the DSV4-shaped F32 path is checked
against CPU.

Assisted-by: Codex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant