Dsv4 metal#3
Open
tarruda wants to merge 3 commits into
Open
Conversation
Implements the DeepSeek V4 lightning indexer on Metal GPU. Follows the CUDA vec kernel approach with 8 SIMD groups per threadgroup, each processing 8 KV vectors using simd_sum for per-head dot product reduction. Supports F32, F16, BF16 and quantized K types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0). Assisted-by: DeepSeek V4 Pro
Implements GGML_OP_DSV4_HC_COMB, GGML_OP_DSV4_HC_PRE, and GGML_OP_DSV4_HC_POST on Metal GPU. HC_PRE performs weighted sum over hc slices, HC_COMB computes Sinkhorn-normalized combination matrices, and HC_POST blends input with residuals using the combination weights. All kernels operate on F32 tensors. Assisted-by: DeepSeek V4 Pro
The Metal lightning indexer assigned a scalar float expression to the float4 threadgroup q tile and derived the address from the packed embedding width. That ignored the q head stride and loaded the wrong q vector data, corrupting indexer scores for DeepSeek V4 on Metal. Load q tiles as strided float4 values, matching the CUDA path, and use the provided source and destination strides in HC_PRE instead of assuming contiguous row layout. Add a LIGHTNING_INDEXER backend-op case so the DSV4-shaped F32 path is checked against CPU. Assisted-by: Codex
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This was mostly vibe coded by Deepseek V4 Pro, with the last commit being Codex fixing a lightning op metal bug introduced in the previous commits (plus a lightning op backend test).
On my M1 Ultra, this increases DSV4-flash token generation speed to ~20tps from ~6tps. I've been running this with this IQ3_XXS quant.
From my testing this looks like it is working well. Unlike the llama.cpp upstream implementation, it doesn't seem to have this bug, or at least I couldn't reproduce after a pi session that used more than 100k tokens in context.
IDK if you have restrictions against AI coded changes or if you are interested in merging. But since the dsv4 branch doesn't seem to be meant for a llama.cpp PR I thought it would be good to have it centralize implementation for more backends, which can be used as reference for future llama.cpp PRs.