Skip to content

spec: add DSpark speculative decoding#25173

Open
wjinxu wants to merge 4 commits into
ggml-org:masterfrom
wjinxu:dspark-upstream
Open

spec: add DSpark speculative decoding#25173
wjinxu wants to merge 4 commits into
ggml-org:masterfrom
wjinxu:dspark-upstream

Conversation

@wjinxu

@wjinxu wjinxu commented Jun 30, 2026

Copy link
Copy Markdown

This PR adds DSpark speculative decoding, layered on the merged DFlash drafter. DSpark (DeepSeek + PKU, 2026 — "Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation", the DeepSpec repo) is DFlash plus a small semi-autoregressive Markov head: where DFlash takes an independent argmax at each block position (every position marginalizes over all possible predecessors, so acceptance decays along the block), DSpark adds a low-rank, previous-token-conditioned logit bias and samples the block left-to-right, so each draft conditions on the one actually sampled before it. This lifts accepted length at near-zero extra draft cost.

DSpark reuses the entire DFlash machinery unchanged — the encoder/decoder graph, target-layer feature extraction (llama_set_embeddings_layer_inp / _nextn), KV-cache injection, and the verify/accept path. The only additions are:

  • a new draft architecture dspark (llama_model_dspark : llama_model_dflash) that reuses the DFlash graph and additionally loads the Markov head (markov_w1, markov_w2) and an optional confidence head; it shares the target's token-embeddings / lm_head (same as DFlash);
  • a new speculative type draft-dspark (common_speculative_impl_draft_dspark : common_speculative_impl_draft_dflash) that reuses process() (extraction + injection) and overrides only draft(): the block is anchor-first (position 0 already predicts the first draft token) and sampled with the Markov bias bias(prev) = markov_w2 · markov_w1[prev], computed on-device (llama_dspark_markov_bias);
  • a Qwen3DSparkModel converter.

Greedy decoding is lossless: the Markov bias only changes which tokens are proposed; every draft is still verified against the target, so the output is identical to non-speculative greedy.

The confidence head is converted/loaded but not used at inference in this PR (phase 1); the draft-quality win from the Markov head is self-contained and is what the numbers below measure.

How to run

Complete example from scratch (Qwen3-8B). Drafts for other sizes are on the same org: deepseek-ai/dspark_qwen3_{4b,8b,14b}_block7.

1. Get the models — target + its DSpark draft:

huggingface-cli download Qwen/Qwen3-8B --local-dir Qwen3-8B
huggingface-cli download deepseek-ai/dspark_qwen3_8b_block7 --local-dir dspark_qwen3_8b

2. Convert to GGUF — the draft ships no tokenizer and reuses the target's, so pass --target-model-dir:

python convert_hf_to_gguf.py Qwen3-8B --outtype bf16 --outfile Qwen3-8B.gguf
python convert_hf_to_gguf.py dspark_qwen3_8b --outtype bf16 \
    --target-model-dir Qwen3-8B --outfile Qwen3-8B-DSpark.gguf

You may quantize the target (e.g. llama-quantize Qwen3-8B.gguf Qwen3-8B-Q4_K_M.gguf Q4_K_M); keep the draft bf16 — it's tiny, and acceptance is unaffected by target quant.

3. Build with CUDA:

cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

4. Run — the only DSpark-specific flags are -md <draft> and --spec-type draft-dspark
(--spec-draft-n-max = draft tokens per step; the released checkpoints use block size 7):

./build/bin/llama-server -m Qwen3-8B.gguf -md Qwen3-8B-DSpark.gguf \
    --spec-type draft-dspark --spec-draft-n-max 7 \
    --temp 0 --top-k 1 -np 1 -c 4096 -ngl 99 -fa on --jinja

5. Send a request (the server logs draft acceptance = ... per request):

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [{"role": "user", "content": "Explain the Pythagorean theorem."}],
  "temperature": 0, "max_tokens": 256
}'

llama-cli works the same way (-m ... -md ... --spec-type draft-dspark). Note: draft-dspark needs the target's hidden states (KV-cache injection), so use llama-server / llama-cli — the speculative-simple example does not drive that path.

Performance

SpeedBench (llama.cpp's own tools/server/bench/speed-bench)

Qwen3-8B (bf16), matched --spec-draft-n-max 7, qualitative split (11 categories), greedy. Baseline is the same server with no draft model. DSpark reaches 1.88× overall decode speedup vs baseline (DFlash is 1.55×), and beats the merged DFlash on every one of the 11 categories (overall 1.21×).

DSpark vs baseline:

category       base_avg_pred_t/s  spec_avg_pred_t/s  decode_speedup  base_avg_latency  spec_avg_latency  latency_speedup  accept_rate
-------------  -----------------  -----------------  --------------  ----------------  ----------------  ---------------  -----------
coding         58.16              123.57             2.12x           11.172s           5.458s            2.05x            0.3219
humanities     58.22              99.06              1.70x           9.573s            5.646s            1.70x            0.2340
math           58.21              109.86             1.89x           10.313s           5.409s            1.91x            0.2840
qa             58.23              107.32             1.84x           8.313s            4.486s            1.85x            0.2659
rag            57.91              123.54             2.13x           9.521s            4.639s            2.05x            0.3264
reasoning      58.21              99.29              1.71x           9.570s            5.622s            1.70x            0.2347
stem           58.19              98.92              1.70x           8.827s            5.205s            1.70x            0.2332
writing        57.82              111.32             1.93x           9.765s            5.282s            1.85x            0.2807
multilingual   58.18              121.96             2.10x           8.691s            4.250s            2.05x            0.3187
summarization  58.36              102.74             1.76x           5.309s            3.001s            1.77x            0.2530
roleplay       58.20              102.56             1.76x           14.139s           8.274s            1.71x            0.2454
overall        58.15              109.10             1.88x           9.563s            5.207s            1.84x            0.2698

DSpark vs the merged DFlash (same --spec-draft-n-max 7):

category       dflash_avg_pred_t/s  dspark_avg_pred_t/s  decode_speedup  dflash_avg_latency  dspark_avg_latency  latency_speedup  accept_rate
-------------  -------------------  -------------------  --------------  ------------------  ------------------  ---------------  -----------
coding         106.00               123.57               1.17x           6.343s              5.458s              1.16x            0.3219
humanities     83.61                99.06                1.18x           6.674s              5.646s              1.18x            0.2340
math           90.48                109.86               1.21x           6.529s              5.409s              1.21x            0.2840
qa             85.20                107.32               1.26x           5.650s              4.486s              1.26x            0.2659
rag            98.61                123.54               1.25x           5.733s              4.639s              1.24x            0.3264
reasoning      83.51                99.29                1.19x           6.681s              5.622s              1.19x            0.2347
stem           83.60                98.92                1.18x           6.154s              5.205s              1.18x            0.2332
writing        90.28                111.32               1.23x           6.443s              5.282s              1.22x            0.2807
multilingual   102.94               121.96               1.18x           5.016s              4.250s              1.18x            0.3187
summarization  85.85                102.74               1.20x           3.606s              3.001s              1.20x            0.2530
roleplay       79.96                102.56               1.28x           10.451s             8.274s              1.26x            0.2454
overall        90.00                109.10               1.21x           6.298s              5.207s              1.21x            0.2698

Hardware: RTX 4090. Target Qwen/Qwen3-8B (bf16), draft deepseek-ai/dspark_qwen3_8b_block7 (bf16). Greedy (--temp 0 --top-k 1), no-thinking, --spec-draft-n-max 7. Baseline = same llama-server with no draft model. DFlash is the merged drafter (z-lab/Qwen3-8B-DFlash, b16), run at the same matched draft size for an apples-to-apples comparison. Per-domain aggregate over the listed prompt counts.

Losslessness

Greedy decoding is lossless by construction (the draft is verified against the target). Output is coherent and matches non-speculative greedy.

Qwen3-4B, target bf16

DSpark vs baseline (DFlash was not benchmarked at 4B — no nested-schema 4B DFlash draft available):

Domain Baseline t/s DSpark t/s (accept) DSpark
GSM8K (40) 103.1 354.0 (75.3%) 3.43×
MATH500 (30) 102.9 341.3 (71.7%) 3.32×
HumanEval (30) 103.9 340.0 (72.9%) 3.27×
MBPP (30) 103.6 281.4 (57.2%) 2.72×
MT-Bench (30) 102.8 190.4 (31.7%) 1.85×
geomean 2.85×

Qwen3-8B, target bf16

Domain Baseline t/s DFlash t/s (accept) DSpark t/s (accept) DFlash DSpark
GSM8K (40) 58.5 182.4 (53.7%) 237.3 (78.9%) 3.12× 4.06×
MATH500 (30) 58.5 195.7 (59.2%) 223.2 (72.8%) 3.35× 3.82×
HumanEval (30) 59.1 238.8 (77.2%) 241.4 (81.7%) 4.04× 4.08×
MBPP (30) 59.6 177.3 (53.3%) 193.1 (63.7%) 2.98× 3.24×
MT-Bench (30) 58.6 93.5 (19.7%) 120.4 (31.3%) 1.60× 2.05×
geomean 2.89× 3.35×

Qwen3-8B, target Q8_0

Domain Baseline t/s DFlash t/s (accept) DSpark t/s (accept) DFlash DSpark
GSM8K (40) 100.6 246.4 (53.2%) 322.9 (77.8%) 2.45× 3.21×
MATH500 (30) 100.5 266.2 (59.2%) 305.7 (72.2%) 2.65× 3.04×
HumanEval (30) 101.3 319.5 (76.5%) 327.9 (81.4%) 3.15× 3.24×
MBPP (30) 102.2 242.4 (54.3%) 268.0 (64.3%) 2.37× 2.62×
MT-Bench (30) 100.7 126.7 (19.3%) 167.8 (31.4%) 1.26× 1.67×
geomean 2.28× 2.68×

Qwen3-8B, target Q4_K_M

Domain Baseline t/s DFlash t/s (accept) DSpark t/s (accept) DFlash DSpark
GSM8K (40) 155.4 259.0 (52.9%) 340.7 (77.4%) 1.67× 2.19×
MATH500 (30) 155.2 284.9 (60.4%) 326.1 (73.9%) 1.84× 2.10×
HumanEval (30) 156.5 314.3 (71.1%) 332.0 (78.4%) 2.01× 2.12×
MBPP (30) 157.5 257.5 (55.5%) 281.3 (66.0%) 1.63× 1.79×
MT-Bench (30) 155.5 135.2 (19.7%) 174.4 (30.6%) 0.87× 1.12×
geomean 1.54× 1.81×

DSpark beats the merged DFlash on every domain (higher accept rate and higher throughput), for a ~1.16× geomean speedup over DFlash. The gains are largest on reasoning (GSM8K +25pp accept, 1.30× over DFlash) and open chat (MT-Bench, 1.29×); on code (HumanEval) the two are close as both already accept ~80%.

Future work

  • Confidence head (phase 2): wire the confidence-scheduled prefix pruning, with the paper's Sequential Temperature Scaling calibration. The big serving win in the paper comes from the batched scheduler, which is a separate, larger change.
  • Markov-bias graph reuse: the bias is computed as a tiny per-step ggml graph on the draft context's scheduler; building it once per block and re-running with new inputs would cut overhead. A fused bias+argmax kernel is a further option but would add a backend-specific op (the current path is pure ggml, no new operator).

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, use Claude to help discuss and design the DSpark architecture, ask clarifying questions, and assist with writing tests. Everything remains under my control, and I reviewed every single line of AI-generated code.

@github-actions github-actions Bot added model Model specific conversion labels Jun 30, 2026
@ggml-gh-bot

This comment was marked as resolved.

@wjinxu wjinxu force-pushed the dspark-upstream branch from f3b83cd to d74ff77 Compare June 30, 2026 14:16
@github-actions github-actions Bot added the testing Everything test related label Jun 30, 2026
@wjinxu wjinxu force-pushed the dspark-upstream branch from d74ff77 to 37f2513 Compare June 30, 2026 14:39
@wjinxu wjinxu marked this pull request as ready for review June 30, 2026 17:00
@wjinxu wjinxu requested review from a team, CISC, JohannesGaessler and ggerganov as code owners June 30, 2026 17:00
@wjinxu

wjinxu commented Jun 30, 2026

Copy link
Copy Markdown
Author

Hi @CISC @ggerganov , this adds DSpark speculative decoding on top of the merged DFlash drafter (#22105). It's a small change — a new dspark draft arch and draft-dspark spec type that reuse DFlash's graph, feature extraction, KV-cache injection and verify path unchanged; the only new logic is the semi-autoregressive Markov head in draft(). Greedy stays lossless.

I benchmarked it against the merged DFlash using DeepSeek's released Qwen3 DSpark drafts. On Qwen3-8B at bf16 / Q8_0 / Q4_K_M, DSpark beats DFlash on every domain (e.g. GSM8K bf16 4.06× vs 3.12×; full per-domain tables in the PR description).

I believe it's ready for review and I'm happy to walk through any part of it.

@ruixiang63

ruixiang63 commented Jun 30, 2026

Copy link
Copy Markdown
Member

Can you run SpeedBench to do the full comparison between DFlash and DSpark with the same --spec-draft-n-max? https://github.com/ggml-org/llama.cpp/tree/master/tools/server/bench/speed-bench

@CISC CISC requested a review from ruixiang63 June 30, 2026 17:47
Comment thread conversion/qwen.py Outdated
Comment thread conversion/qwen.py Outdated
Comment thread conversion/qwen.py Outdated
@wjinxu

wjinxu commented Jun 30, 2026

Copy link
Copy Markdown
Author

@ruixiang63 I've run the SpeedBench test set as you suggested, and updated the results in the PR description. DSpark does outperform DFlash across the board.

@nipeone

nipeone commented Jul 1, 2026

Copy link
Copy Markdown

could you give some examples how to use?

@wjinxu

wjinxu commented Jul 1, 2026

Copy link
Copy Markdown
Author

could you give some examples how to use?

Good point — I've updated the PR description with a more detailed, copy-pasteable end-to-end example (download → convert → build → run → curl). Let me know if anything's unclear.

Comment thread src/llama-model.cpp Outdated
@wjinxu wjinxu force-pushed the dspark-upstream branch from d8b38f2 to 47f3442 Compare July 1, 2026 05:17
@am17an

am17an commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

DSV4 support was merged in #24162, ideally this PR should cover that model as well and try to replicate a similar speedup

@wjinxu

wjinxu commented Jul 1, 2026

Copy link
Copy Markdown
Author

DSV4 support was merged in #24162, ideally this PR should cover that model as well and try to replicate a similar speedup

Thanks! DeepSeek hasn't open-sourced the DSpark weights for DeepSeek-V4 though — only the Qwen3 and Gemma4 drafts are released. So this PR covers Qwen3 for now, and I'll add Gemma4 as a small follow-up.

@am17an

am17an commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

I think they're a part of the spec decoding module https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark i.e not distributed separately

@wjinxu

wjinxu commented Jul 1, 2026

Copy link
Copy Markdown
Author

I think they're a part of the spec decoding module https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark i.e not distributed separately

Sorry, and thanks for the heads-up. For this PR I'd like to keep the scope a bit narrower for now - land the Qwen3 DSpark path first and get it solid, then add Gemma4 and DSV4 as follow-ups. Does that sound ok?

@am17an

am17an commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Okay, will try to review. From a cursory look it does not look like adding llama_dspark* is the right thing to do. Honestly don't have a good feeling about the PR, too much AI code.

@wjinxu

wjinxu commented Jul 1, 2026

Copy link
Copy Markdown
Author

Okay, will try to review. From a cursory look it does not look like adding llama_dspark* is the right thing to do. Honestly don't have a good feeling about the PR, too much AI code.

I agree that llama_dspark_* shouldn't be part of the API. The issue is that the Markov bias computation (W2 @ W1[prev]) needs the draft weights and has to run on the backend - I measured it, and doing it on the host is too slow. But the DSpark drafter lives in common/, which can't reach the draft weights, so once the API is removed there's no way to trigger that computation from common/.

@wjinxu wjinxu marked this pull request as draft July 2, 2026 02:24
DSpark (DeepSpec, 2026) on top of the merged DFlash drafter. It reuses the
DFlash encoder/decoder graph, target feature extraction and KV-cache injection,
and the verify/accept path unchanged; the draft model is a new "dspark" arch
adding a low-rank Markov head (markov_w1/w2) and an optional (unused here)
confidence head. No new public APIs.

The proposal is the only change: the block is anchor-first (position 0 already
predicts the first draft) and the decoder graph applies a semi-autoregressive,
previous-token conditioned logit bias in-graph, chained per block position:

  logits'(i) = logits(i) + markov_w2 . markov_w1[prev(i)]
  prev(0)    = the block's anchor token, prev(i>0) = argmax(logits'(i-1))

vectorized across all blocks in the batch; the anchors are fed through a
dedicated graph input (token 0 of every block). Greedy stays lossless
(verify unchanged, same as DFlash).

- new arch "dspark" (llama_model_dspark : llama_model_dflash, reuses the graph,
  loads the markov/confidence tensors; shares the target's embed/lm_head).
- Qwen3DSparkModel converter.
- new spec type "draft-dspark" (common_speculative_impl_draft_dspark :
  common_speculative_impl_draft_dflash, overrides draft() only: submits whole
  anchor-first blocks and greedily reads back the biased logits).
@wjinxu wjinxu force-pushed the dspark-upstream branch from 47f3442 to 96c5be9 Compare July 2, 2026 07:24
@wjinxu

wjinxu commented Jul 2, 2026

Copy link
Copy Markdown
Author

@am17an Thanks for the feedback — you were right about the API. I've reworked the PR:

All llama_dspark_* public APIs are gone. The Markov head is now applied inside the dspark decoder graph(chained per block position, vectorized across blocks), so llama.h / llama-ext.h are untouched and common/speculative.cpp only overrides draft() on top of the DFlash impl, same pattern as before.

I've reviewed and can explain every line of the code myself, and re-verified.

I'd really appreciate another look when you have time.

@wjinxu wjinxu marked this pull request as ready for review July 2, 2026 07:34
Comment thread common/speculative.cpp
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jul 4, 2026
@wjinxu

wjinxu commented Jul 4, 2026

Copy link
Copy Markdown
Author

@ruixiang63 Could you take another look at this PR when you have time? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conversion documentation Improvements or additions to documentation model Model specific testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants