spec: add DSpark speculative decoding#25173
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
Hi @CISC @ggerganov , this adds DSpark speculative decoding on top of the merged DFlash drafter (#22105). It's a small change — a new dspark draft arch and draft-dspark spec type that reuse DFlash's graph, feature extraction, KV-cache injection and verify path unchanged; the only new logic is the semi-autoregressive Markov head in draft(). Greedy stays lossless. I benchmarked it against the merged DFlash using DeepSeek's released Qwen3 DSpark drafts. On Qwen3-8B at bf16 / Q8_0 / Q4_K_M, DSpark beats DFlash on every domain (e.g. GSM8K bf16 4.06× vs 3.12×; full per-domain tables in the PR description). I believe it's ready for review and I'm happy to walk through any part of it. |
|
Can you run SpeedBench to do the full comparison between DFlash and DSpark with the same |
|
@ruixiang63 I've run the SpeedBench test set as you suggested, and updated the results in the PR description. DSpark does outperform DFlash across the board. |
aae2941 to
d8b38f2
Compare
|
could you give some examples how to use? |
Good point — I've updated the PR description with a more detailed, copy-pasteable end-to-end example (download → convert → build → run → curl). Let me know if anything's unclear. |
|
DSV4 support was merged in #24162, ideally this PR should cover that model as well and try to replicate a similar speedup |
Thanks! DeepSeek hasn't open-sourced the DSpark weights for DeepSeek-V4 though — only the Qwen3 and Gemma4 drafts are released. So this PR covers Qwen3 for now, and I'll add Gemma4 as a small follow-up. |
|
I think they're a part of the spec decoding module https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark i.e not distributed separately |
Sorry, and thanks for the heads-up. For this PR I'd like to keep the scope a bit narrower for now - land the Qwen3 DSpark path first and get it solid, then add Gemma4 and DSV4 as follow-ups. Does that sound ok? |
|
Okay, will try to review. From a cursory look it does not look like adding |
I agree that |
DSpark (DeepSpec, 2026) on top of the merged DFlash drafter. It reuses the DFlash encoder/decoder graph, target feature extraction and KV-cache injection, and the verify/accept path unchanged; the draft model is a new "dspark" arch adding a low-rank Markov head (markov_w1/w2) and an optional (unused here) confidence head. No new public APIs. The proposal is the only change: the block is anchor-first (position 0 already predicts the first draft) and the decoder graph applies a semi-autoregressive, previous-token conditioned logit bias in-graph, chained per block position: logits'(i) = logits(i) + markov_w2 . markov_w1[prev(i)] prev(0) = the block's anchor token, prev(i>0) = argmax(logits'(i-1)) vectorized across all blocks in the batch; the anchors are fed through a dedicated graph input (token 0 of every block). Greedy stays lossless (verify unchanged, same as DFlash). - new arch "dspark" (llama_model_dspark : llama_model_dflash, reuses the graph, loads the markov/confidence tensors; shares the target's embed/lm_head). - Qwen3DSparkModel converter. - new spec type "draft-dspark" (common_speculative_impl_draft_dspark : common_speculative_impl_draft_dflash, overrides draft() only: submits whole anchor-first blocks and greedily reads back the biased logits).
|
@am17an Thanks for the feedback — you were right about the API. I've reworked the PR: All llama_dspark_* public APIs are gone. The Markov head is now applied inside the dspark decoder graph(chained per block position, vectorized across blocks), so llama.h / llama-ext.h are untouched and common/speculative.cpp only overrides draft() on top of the DFlash impl, same pattern as before. I've reviewed and can explain every line of the code myself, and re-verified. I'd really appreciate another look when you have time. |
|
@ruixiang63 Could you take another look at this PR when you have time? Thanks! |
This PR adds DSpark speculative decoding, layered on the merged DFlash drafter. DSpark (DeepSeek + PKU, 2026 — "Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation", the DeepSpec repo) is DFlash plus a small semi-autoregressive Markov head: where DFlash takes an independent argmax at each block position (every position marginalizes over all possible predecessors, so acceptance decays along the block), DSpark adds a low-rank, previous-token-conditioned logit bias and samples the block left-to-right, so each draft conditions on the one actually sampled before it. This lifts accepted length at near-zero extra draft cost.
DSpark reuses the entire DFlash machinery unchanged — the encoder/decoder graph, target-layer feature extraction (
llama_set_embeddings_layer_inp/_nextn), KV-cache injection, and the verify/accept path. The only additions are:dspark(llama_model_dspark : llama_model_dflash) that reuses the DFlash graph and additionally loads the Markov head (markov_w1,markov_w2) and an optional confidence head; it shares the target's token-embeddings / lm_head (same as DFlash);draft-dspark(common_speculative_impl_draft_dspark : common_speculative_impl_draft_dflash) that reusesprocess()(extraction + injection) and overrides onlydraft(): the block is anchor-first (position 0 already predicts the first draft token) and sampled with the Markov biasbias(prev) = markov_w2 · markov_w1[prev], computed on-device (llama_dspark_markov_bias);Qwen3DSparkModelconverter.Greedy decoding is lossless: the Markov bias only changes which tokens are proposed; every draft is still verified against the target, so the output is identical to non-speculative greedy.
The confidence head is converted/loaded but not used at inference in this PR (phase 1); the draft-quality win from the Markov head is self-contained and is what the numbers below measure.
How to run
Complete example from scratch (Qwen3-8B). Drafts for other sizes are on the same org:
deepseek-ai/dspark_qwen3_{4b,8b,14b}_block7.1. Get the models — target + its DSpark draft:
2. Convert to GGUF — the draft ships no tokenizer and reuses the target's, so pass
--target-model-dir:python convert_hf_to_gguf.py Qwen3-8B --outtype bf16 --outfile Qwen3-8B.gguf python convert_hf_to_gguf.py dspark_qwen3_8b --outtype bf16 \ --target-model-dir Qwen3-8B --outfile Qwen3-8B-DSpark.ggufYou may quantize the target (e.g.
llama-quantize Qwen3-8B.gguf Qwen3-8B-Q4_K_M.gguf Q4_K_M); keep the draft bf16 — it's tiny, and acceptance is unaffected by target quant.3. Build with CUDA:
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j4. Run — the only DSpark-specific flags are
-md <draft>and--spec-type draft-dspark(
--spec-draft-n-max= draft tokens per step; the released checkpoints use block size 7):./build/bin/llama-server -m Qwen3-8B.gguf -md Qwen3-8B-DSpark.gguf \ --spec-type draft-dspark --spec-draft-n-max 7 \ --temp 0 --top-k 1 -np 1 -c 4096 -ngl 99 -fa on --jinja5. Send a request (the server logs
draft acceptance = ...per request):llama-cliworks the same way (-m ... -md ... --spec-type draft-dspark). Note:draft-dsparkneeds the target's hidden states (KV-cache injection), so usellama-server/llama-cli— thespeculative-simpleexample does not drive that path.Performance
SpeedBench (llama.cpp's own
tools/server/bench/speed-bench)Qwen3-8B (bf16), matched
--spec-draft-n-max 7,qualitativesplit (11 categories), greedy. Baseline is the same server with no draft model. DSpark reaches 1.88× overall decode speedup vs baseline (DFlash is 1.55×), and beats the merged DFlash on every one of the 11 categories (overall 1.21×).DSpark vs baseline:
DSpark vs the merged DFlash (same
--spec-draft-n-max 7):Hardware: RTX 4090. Target Qwen/Qwen3-8B (bf16), draft deepseek-ai/dspark_qwen3_8b_block7 (bf16). Greedy (
--temp 0 --top-k 1), no-thinking,--spec-draft-n-max 7. Baseline = same llama-server with no draft model. DFlash is the merged drafter (z-lab/Qwen3-8B-DFlash, b16), run at the same matched draft size for an apples-to-apples comparison. Per-domain aggregate over the listed prompt counts.Losslessness
Greedy decoding is lossless by construction (the draft is verified against the target). Output is coherent and matches non-speculative greedy.
Qwen3-4B, target bf16
DSpark vs baseline (DFlash was not benchmarked at 4B — no nested-schema 4B DFlash draft available):
Qwen3-8B, target bf16
Qwen3-8B, target Q8_0
Qwen3-8B, target Q4_K_M
DSpark beats the merged DFlash on every domain (higher accept rate and higher throughput), for a ~1.16× geomean speedup over DFlash. The gains are largest on reasoning (GSM8K +25pp accept, 1.30× over DFlash) and open chat (MT-Bench, 1.29×); on code (HumanEval) the two are close as both already accept ~80%.
Future work
Requirements