spec: add DSpark speculative decoding by wjinxu · Pull Request #25173 · ggml-org/llama.cpp

wjinxu · 2026-06-30T13:55:32Z

This PR adds DSpark speculative decoding, layered on the merged DFlash drafter. DSpark (DeepSeek + PKU, 2026 — "Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation", the DeepSpec repo) is DFlash plus a small semi-autoregressive Markov head: where DFlash takes an independent argmax at each block position (every position marginalizes over all possible predecessors, so acceptance decays along the block), DSpark adds a low-rank, previous-token-conditioned logit bias and samples the block left-to-right, so each draft conditions on the one actually sampled before it. This lifts accepted length at near-zero extra draft cost.

DSpark reuses the entire DFlash machinery unchanged — the encoder/decoder graph, target-layer feature extraction (llama_set_embeddings_layer_inp / _nextn), KV-cache injection, and the verify/accept path. The only additions are:

a new draft architecture dspark (llama_model_dspark : llama_model_dflash) that reuses the DFlash graph and additionally loads the Markov head (markov_w1, markov_w2) and an optional confidence head; it shares the target's token-embeddings / lm_head (same as DFlash);
a new speculative type draft-dspark (common_speculative_impl_draft_dspark : common_speculative_impl_draft_dflash) that reuses process() (extraction + injection) and overrides only draft(): the block is anchor-first (position 0 already predicts the first draft token) and sampled with the Markov bias bias(prev) = markov_w2 · markov_w1[prev], computed on-device (llama_dspark_markov_bias);
a Qwen3DSparkModel converter.

Greedy decoding is lossless: the Markov bias only changes which tokens are proposed; every draft is still verified against the target, so the output is identical to non-speculative greedy.

The confidence head is converted/loaded but not used at inference in this PR (phase 1); the draft-quality win from the Markov head is self-contained and is what the numbers below measure.

How to run

Complete example from scratch (Qwen3-8B). Drafts for other sizes are on the same org: deepseek-ai/dspark_qwen3_{4b,8b,14b}_block7.

1. Get the models — target + its DSpark draft:

huggingface-cli download Qwen/Qwen3-8B --local-dir Qwen3-8B
huggingface-cli download deepseek-ai/dspark_qwen3_8b_block7 --local-dir dspark_qwen3_8b

2. Convert to GGUF — the draft ships no tokenizer and reuses the target's, so pass --target-model-dir:

python convert_hf_to_gguf.py Qwen3-8B --outtype bf16 --outfile Qwen3-8B.gguf
python convert_hf_to_gguf.py dspark_qwen3_8b --outtype bf16 \
    --target-model-dir Qwen3-8B --outfile Qwen3-8B-DSpark.gguf

You may quantize the target (e.g. llama-quantize Qwen3-8B.gguf Qwen3-8B-Q4_K_M.gguf Q4_K_M); keep the draft bf16 — it's tiny, and acceptance is unaffected by target quant.

3. Build with CUDA:

cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

4. Run — the only DSpark-specific flags are -md <draft> and --spec-type draft-dspark
(--spec-draft-n-max = draft tokens per step; the released checkpoints use block size 7):

./build/bin/llama-server -m Qwen3-8B.gguf -md Qwen3-8B-DSpark.gguf \
    --spec-type draft-dspark --spec-draft-n-max 7 \
    --temp 0 --top-k 1 -np 1 -c 4096 -ngl 99 -fa on --jinja

5. Send a request (the server logs draft acceptance = ... per request):

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [{"role": "user", "content": "Explain the Pythagorean theorem."}],
  "temperature": 0, "max_tokens": 256
}'

llama-cli works the same way (-m ... -md ... --spec-type draft-dspark). Note: draft-dspark needs the target's hidden states (KV-cache injection), so use llama-server / llama-cli — the speculative-simple example does not drive that path.

Performance

SpeedBench (llama.cpp's own `tools/server/bench/speed-bench`)

Qwen3-8B (bf16), matched --spec-draft-n-max 7, qualitative split (11 categories), greedy. Baseline is the same server with no draft model. DSpark reaches 1.88× overall decode speedup vs baseline (DFlash is 1.55×), and beats the merged DFlash on every one of the 11 categories (overall 1.21×).

DSpark vs baseline:

category       base_avg_pred_t/s  spec_avg_pred_t/s  decode_speedup  base_avg_latency  spec_avg_latency  latency_speedup  accept_rate
-------------  -----------------  -----------------  --------------  ----------------  ----------------  ---------------  -----------
coding         58.16              123.57             2.12x           11.172s           5.458s            2.05x            0.3219
humanities     58.22              99.06              1.70x           9.573s            5.646s            1.70x            0.2340
math           58.21              109.86             1.89x           10.313s           5.409s            1.91x            0.2840
qa             58.23              107.32             1.84x           8.313s            4.486s            1.85x            0.2659
rag            57.91              123.54             2.13x           9.521s            4.639s            2.05x            0.3264
reasoning      58.21              99.29              1.71x           9.570s            5.622s            1.70x            0.2347
stem           58.19              98.92              1.70x           8.827s            5.205s            1.70x            0.2332
writing        57.82              111.32             1.93x           9.765s            5.282s            1.85x            0.2807
multilingual   58.18              121.96             2.10x           8.691s            4.250s            2.05x            0.3187
summarization  58.36              102.74             1.76x           5.309s            3.001s            1.77x            0.2530
roleplay       58.20              102.56             1.76x           14.139s           8.274s            1.71x            0.2454
overall        58.15              109.10             1.88x           9.563s            5.207s            1.84x            0.2698

DSpark vs the merged DFlash (same --spec-draft-n-max 7):

category       dflash_avg_pred_t/s  dspark_avg_pred_t/s  decode_speedup  dflash_avg_latency  dspark_avg_latency  latency_speedup  accept_rate
-------------  -------------------  -------------------  --------------  ------------------  ------------------  ---------------  -----------
coding         106.00               123.57               1.17x           6.343s              5.458s              1.16x            0.3219
humanities     83.61                99.06                1.18x           6.674s              5.646s              1.18x            0.2340
math           90.48                109.86               1.21x           6.529s              5.409s              1.21x            0.2840
qa             85.20                107.32               1.26x           5.650s              4.486s              1.26x            0.2659
rag            98.61                123.54               1.25x           5.733s              4.639s              1.24x            0.3264
reasoning      83.51                99.29                1.19x           6.681s              5.622s              1.19x            0.2347
stem           83.60                98.92                1.18x           6.154s              5.205s              1.18x            0.2332
writing        90.28                111.32               1.23x           6.443s              5.282s              1.22x            0.2807
multilingual   102.94               121.96               1.18x           5.016s              4.250s              1.18x            0.3187
summarization  85.85                102.74               1.20x           3.606s              3.001s              1.20x            0.2530
roleplay       79.96                102.56               1.28x           10.451s             8.274s              1.26x            0.2454
overall        90.00                109.10               1.21x           6.298s              5.207s              1.21x            0.2698

Hardware: RTX 4090. Target Qwen/Qwen3-8B (bf16), draft deepseek-ai/dspark_qwen3_8b_block7 (bf16). Greedy (--temp 0 --top-k 1), no-thinking, --spec-draft-n-max 7. Baseline = same llama-server with no draft model. DFlash is the merged drafter (z-lab/Qwen3-8B-DFlash, b16), run at the same matched draft size for an apples-to-apples comparison. Per-domain aggregate over the listed prompt counts.

Losslessness

Greedy decoding is lossless by construction (the draft is verified against the target). Output is coherent and matches non-speculative greedy.

Qwen3-4B, target bf16

DSpark vs baseline (DFlash was not benchmarked at 4B — no nested-schema 4B DFlash draft available):

Domain	Baseline t/s	DSpark t/s (accept)	DSpark
GSM8K (40)	103.1	354.0 (75.3%)	3.43×
MATH500 (30)	102.9	341.3 (71.7%)	3.32×
HumanEval (30)	103.9	340.0 (72.9%)	3.27×
MBPP (30)	103.6	281.4 (57.2%)	2.72×
MT-Bench (30)	102.8	190.4 (31.7%)	1.85×
geomean			2.85×

Qwen3-8B, target bf16

Domain	Baseline t/s	DFlash t/s (accept)	DSpark t/s (accept)	DFlash	DSpark
GSM8K (40)	58.5	182.4 (53.7%)	237.3 (78.9%)	3.12×	4.06×
MATH500 (30)	58.5	195.7 (59.2%)	223.2 (72.8%)	3.35×	3.82×
HumanEval (30)	59.1	238.8 (77.2%)	241.4 (81.7%)	4.04×	4.08×
MBPP (30)	59.6	177.3 (53.3%)	193.1 (63.7%)	2.98×	3.24×
MT-Bench (30)	58.6	93.5 (19.7%)	120.4 (31.3%)	1.60×	2.05×
geomean				2.89×	3.35×

Qwen3-8B, target Q8_0

Domain	Baseline t/s	DFlash t/s (accept)	DSpark t/s (accept)	DFlash	DSpark
GSM8K (40)	100.6	246.4 (53.2%)	322.9 (77.8%)	2.45×	3.21×
MATH500 (30)	100.5	266.2 (59.2%)	305.7 (72.2%)	2.65×	3.04×
HumanEval (30)	101.3	319.5 (76.5%)	327.9 (81.4%)	3.15×	3.24×
MBPP (30)	102.2	242.4 (54.3%)	268.0 (64.3%)	2.37×	2.62×
MT-Bench (30)	100.7	126.7 (19.3%)	167.8 (31.4%)	1.26×	1.67×
geomean				2.28×	2.68×

Qwen3-8B, target Q4_K_M

Domain	Baseline t/s	DFlash t/s (accept)	DSpark t/s (accept)	DFlash	DSpark
GSM8K (40)	155.4	259.0 (52.9%)	340.7 (77.4%)	1.67×	2.19×
MATH500 (30)	155.2	284.9 (60.4%)	326.1 (73.9%)	1.84×	2.10×
HumanEval (30)	156.5	314.3 (71.1%)	332.0 (78.4%)	2.01×	2.12×
MBPP (30)	157.5	257.5 (55.5%)	281.3 (66.0%)	1.63×	1.79×
MT-Bench (30)	155.5	135.2 (19.7%)	174.4 (30.6%)	0.87×	1.12×
geomean				1.54×	1.81×

DSpark beats the merged DFlash on every domain (higher accept rate and higher throughput), for a ~1.16× geomean speedup over DFlash. The gains are largest on reasoning (GSM8K +25pp accept, 1.30× over DFlash) and open chat (MT-Bench, 1.29×); on code (HumanEval) the two are close as both already accept ~80%.

Future work

Confidence head (phase 2): wire the confidence-scheduled prefix pruning, with the paper's Sequential Temperature Scaling calibration. The big serving win in the paper comes from the batched scheduler, which is a separate, larger change.
Markov-bias graph reuse: the bias is computed as a tiny per-step ggml graph on the draft context's scheduler; building it once per block and re-running with new inputs would cut overhead. A fused bias+argmax kernel is a further option but would add a backend-specific op (the current path is pure ggml, no new operator).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, use Claude to help discuss and design the DSpark architecture, ask clarifying questions, and assist with writing tests. Everything remains under my control, and I reviewed every single line of AI-generated code.

wjinxu · 2026-06-30T17:02:38Z

Hi @CISC @ggerganov , this adds DSpark speculative decoding on top of the merged DFlash drafter (#22105). It's a small change — a new dspark draft arch and draft-dspark spec type that reuse DFlash's graph, feature extraction, KV-cache injection and verify path unchanged; the only new logic is the semi-autoregressive Markov head in draft(). Greedy stays lossless.

I benchmarked it against the merged DFlash using DeepSeek's released Qwen3 DSpark drafts. On Qwen3-8B at bf16 / Q8_0 / Q4_K_M, DSpark beats DFlash on every domain (e.g. GSM8K bf16 4.06× vs 3.12×; full per-domain tables in the PR description).

I believe it's ready for review and I'm happy to walk through any part of it.

ruixiang63 · 2026-06-30T17:07:15Z

Can you run SpeedBench to do the full comparison between DFlash and DSpark with the same --spec-draft-n-max? https://github.com/ggml-org/llama.cpp/tree/master/tools/server/bench/speed-bench

wjinxu · 2026-06-30T18:05:45Z

@ruixiang63 I've run the SpeedBench test set as you suggested, and updated the results in the PR description. DSpark does outperform DFlash across the board.

nipeone · 2026-07-01T04:40:24Z

could you give some examples how to use?

wjinxu · 2026-07-01T05:06:47Z

could you give some examples how to use?

Good point — I've updated the PR description with a more detailed, copy-pasteable end-to-end example (download → convert → build → run → curl). Let me know if anything's unclear.

am17an · 2026-07-01T06:28:10Z

DSV4 support was merged in #24162, ideally this PR should cover that model as well and try to replicate a similar speedup

wjinxu · 2026-07-01T06:37:36Z

DSV4 support was merged in #24162, ideally this PR should cover that model as well and try to replicate a similar speedup

Thanks! DeepSeek hasn't open-sourced the DSpark weights for DeepSeek-V4 though — only the Qwen3 and Gemma4 drafts are released. So this PR covers Qwen3 for now, and I'll add Gemma4 as a small follow-up.

am17an · 2026-07-01T06:59:38Z

I think they're a part of the spec decoding module https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark i.e not distributed separately

wjinxu · 2026-07-01T07:16:50Z

I think they're a part of the spec decoding module https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark i.e not distributed separately

Sorry, and thanks for the heads-up. For this PR I'd like to keep the scope a bit narrower for now - land the Qwen3 DSpark path first and get it solid, then add Gemma4 and DSV4 as follow-ups. Does that sound ok?

am17an · 2026-07-01T07:28:44Z

Okay, will try to review. From a cursory look it does not look like adding llama_dspark* is the right thing to do. Honestly don't have a good feeling about the PR, too much AI code.

wjinxu · 2026-07-01T08:27:15Z

Okay, will try to review. From a cursory look it does not look like adding llama_dspark* is the right thing to do. Honestly don't have a good feeling about the PR, too much AI code.

I agree that llama_dspark_* shouldn't be part of the API. The issue is that the Markov bias computation (W2 @ W1[prev]) needs the draft weights and has to run on the backend - I measured it, and doing it on the host is too slow. But the DSpark drafter lives in common/, which can't reach the draft weights, so once the API is removed there's no way to trigger that computation from common/.

DSpark (DeepSpec, 2026) on top of the merged DFlash drafter. It reuses the DFlash encoder/decoder graph, target feature extraction and KV-cache injection, and the verify/accept path unchanged; the draft model is a new "dspark" arch adding a low-rank Markov head (markov_w1/w2) and an optional (unused here) confidence head. No new public APIs. The proposal is the only change: the block is anchor-first (position 0 already predicts the first draft) and the decoder graph applies a semi-autoregressive, previous-token conditioned logit bias in-graph, chained per block position: logits'(i) = logits(i) + markov_w2 . markov_w1[prev(i)] prev(0) = the block's anchor token, prev(i>0) = argmax(logits'(i-1)) vectorized across all blocks in the batch; the anchors are fed through a dedicated graph input (token 0 of every block). Greedy stays lossless (verify unchanged, same as DFlash). - new arch "dspark" (llama_model_dspark : llama_model_dflash, reuses the graph, loads the markov/confidence tensors; shares the target's embed/lm_head). - Qwen3DSparkModel converter. - new spec type "draft-dspark" (common_speculative_impl_draft_dspark : common_speculative_impl_draft_dflash, overrides draft() only: submits whole anchor-first blocks and greedily reads back the biased logits).

wjinxu · 2026-07-02T07:34:19Z

@am17an Thanks for the feedback — you were right about the API. I've reworked the PR:

All llama_dspark_* public APIs are gone. The Markov head is now applied inside the dspark decoder graph(chained per block position, vectorized across blocks), so llama.h / llama-ext.h are untouched and common/speculative.cpp only overrides draft() on top of the DFlash impl, same pattern as before.

I've reviewed and can explain every line of the code myself, and re-verified.

I'd really appreciate another look when you have time.

wjinxu · 2026-07-04T13:30:42Z

@ruixiang63 Could you take another look at this PR when you have time? Thanks!

github-actions Bot added model Model specific conversion labels Jun 30, 2026

This comment was marked as resolved.

Sign in to view

wjinxu force-pushed the dspark-upstream branch from f3b83cd to d74ff77 Compare June 30, 2026 14:16

github-actions Bot added the testing Everything test related label Jun 30, 2026

wjinxu force-pushed the dspark-upstream branch from d74ff77 to 37f2513 Compare June 30, 2026 14:39

wjinxu marked this pull request as ready for review June 30, 2026 17:00

wjinxu requested review from a team, CISC, JohannesGaessler and ggerganov as code owners June 30, 2026 17:00

CISC requested a review from ruixiang63 June 30, 2026 17:47

CISC reviewed Jun 30, 2026

View reviewed changes

Comment thread conversion/qwen.py Outdated

Comment thread conversion/qwen.py Outdated

CISC reviewed Jun 30, 2026

View reviewed changes

Comment thread conversion/qwen.py Outdated

wjinxu force-pushed the dspark-upstream branch 2 times, most recently from aae2941 to d8b38f2 Compare June 30, 2026 18:36

wjinxu mentioned this pull request Jul 1, 2026

Feature Request: DSpark confidence-scheduled verification & semi-autoregressive drafting #25096

Open

4 tasks

lym000000 reviewed Jul 1, 2026

View reviewed changes

Comment thread src/llama-model.cpp Outdated

wjinxu force-pushed the dspark-upstream branch from d8b38f2 to 47f3442 Compare July 1, 2026 05:17

wjinxu marked this pull request as draft July 2, 2026 02:24

wjinxu force-pushed the dspark-upstream branch from 47f3442 to 96c5be9 Compare July 2, 2026 07:24

wjinxu marked this pull request as ready for review July 2, 2026 07:34

am17an reviewed Jul 4, 2026

View reviewed changes

Comment thread common/speculative.cpp

wjinxu added 3 commits July 4, 2026 14:44

spec: read draft block size in the dflash impl

8c548e7

docs: add DSpark section to speculative.md

47932db

spec: keep dspark block size read in the dspark impl

1ba891a

github-actions Bot added the documentation Improvements or additions to documentation label Jul 4, 2026

Uh oh!

Conversation

wjinxu commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run

Performance

SpeedBench (llama.cpp's own tools/server/bench/speed-bench)

Losslessness

Qwen3-4B, target bf16

Qwen3-8B, target bf16

Qwen3-8B, target Q8_0

Qwen3-8B, target Q4_K_M

Future work

Requirements

Uh oh!

This comment was marked as resolved.

wjinxu commented Jun 30, 2026

Uh oh!

ruixiang63 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjinxu commented Jun 30, 2026

Uh oh!

nipeone commented Jul 1, 2026

Uh oh!

wjinxu commented Jul 1, 2026

Uh oh!

Uh oh!

am17an commented Jul 1, 2026

Uh oh!

wjinxu commented Jul 1, 2026

Uh oh!

am17an commented Jul 1, 2026

Uh oh!

wjinxu commented Jul 1, 2026

Uh oh!

am17an commented Jul 1, 2026

Uh oh!

wjinxu commented Jul 1, 2026

Uh oh!

wjinxu commented Jul 2, 2026

Uh oh!

Uh oh!

wjinxu commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wjinxu commented Jun 30, 2026 •

edited

Loading

SpeedBench (llama.cpp's own `tools/server/bench/speed-bench`)

ruixiang63 commented Jun 30, 2026 •

edited

Loading