[Speculative decoding] feat: add DFlash support#22105
Conversation
|
Hi @ruixiang63, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
I think the method of exposing the hidden states of the target model needs to be cleaner, as it's used in both eagle3 and dflash and I guess even MTP. Probably needs a refactoring to expose these endpoints |
@ggerganov has already worked on this refactoring work. And you’re very welcome to contribute if you have any better ideas for this PR :) |
|
Trying this against Issue 1 (small, easy):
|
|
Rebased onto the latest master. Hybrid target models (e.g. Qwen3.5) now benefit from the speculative checkpointing mechanism recently merged upstream and the DFlash performance gets better. PR description updated with the new performance numbers. |
|
Have you also looked at DDTree perhaps? |
Not yet, but I’ll take a look. I’d expect it to come after this PR gets merged. |
|
Out of curiosity, have you tested quantizing the DFlash model to Q8? https://huggingface.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test |
|
I don’t know if this is useful, but I managed to get it working on AMD (though with poor performance) Main GPU: R9700 AI PRO running Unsloth Q5 Qwen 27B 3.5 (Vulkan backend) + DFlash bf16 compiled GGUF The acceptance rate works well with the current parameters; changing them does affect the rate. The It actually runs. Here’s the command and the result. If you’d like me to test something specific that might help, just let me know. I’m clearly out of my depth and can’t really suggest improvements. Command + result + evaluation : Click to expand |
llama.cpp-b8941 does not have this parameter. |
Because it's not merged yet to master branch? |
When is it expected to be merged into master? |
|
getting this startup error: tried these models: meanwhile https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF fails to load on startup: set_dflash: DFlash extraction enabled for layers [0, 0, 0, 0, 0] exec "$LLAMA_SERVER"
|
|
I'm getting the following error trying to run this PR with the Vulkan backend on an R9700, only one token is generated before it crashes:
Full Log |
The DFlash GGUF you referenced is meant for another fork of llama.cpp, not this PR. |
|
My plan of next steps for this PR: This PR currently supports Current working commands for # llama-cli
./build/bin/llama-cli \
-m "${TARGET_MODEL_GGUF}" \
-md "${DFLASH_MODEL_GGUF}" \
--dflash -p "Write a quicksort algorithm in Python. Write code only." -n 256 --draft-max 16 \
-cd 512 -c 512 \
--temp 0 --top-k 1 --seed 42 -ngl 99 -ngld 99 \
--jinja -rea off
# llama-server
./build/bin/llama-server \
-m "${TARGET_MODEL_GGUF}" \
-md "${DFLASH_MODEL_GGUF}" \
--dflash --draft-max 16 \
-c 2048 -cd 512 \
--temp 0 --top-k 1 --seed 42 \
-ngl 99 -ngld 99 \
--jinja -rea off \
-np 1 \
--host 0.0.0.0 --port 8088 |
|
Why isn’t there any speedup after enabling the dfloat parameter on this branch?Meanwhile, performance drops significantly when I switch to the official parameters. T_T 200 tokens/s as normal: fallback to 40 tokens: |
|
I've tried many different patches and configurations over the weekend for my single 3090 setup. There's no benefit in Dflash I can see. I cannot reproduce any of the claimed speed ups in real workflows with Qwen 27B or Qwen 35B. Originally posted by @aminya in TheTom#103 (comment) |
|
For me crashing after generating 1 token.
�[0mdone_getting_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead �[0mextract_dflash_features: Start to extract DFlash features: 5 layers, 4 tokens, 5120 embd �[0mres send: sending result for task id = 0 data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Hello"}}],"created":1777351446,"id":"chatcmpl-DbepysCBNfjQ2DyiF5glro4x77VGN159","model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":13,"prompt_ms":731.354,"prompt_per_token_ms":56.258,"prompt_per_second":17.77524974225888,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}} �[0m |
| block_size = self.hparams.get("block_size", 16) | ||
| self.gguf_writer.add_uint32(f"{self.gguf_writer.arch}.block_size", block_size) | ||
| dflash_config = self.hparams.get("dflash_config", {}) | ||
|
|
||
| target_layer_ids = dflash_config.get("target_layer_ids", []) | ||
| if target_layer_ids: | ||
| extract_layer_ids = [i + 1 for i in target_layer_ids] | ||
| self.gguf_writer.add_array(f"{self.gguf_writer.arch}.target_layers", extract_layer_ids) |
There was a problem hiding this comment.
Add proper keys and methods for these please!
| mask_token_id = dflash_config.get("mask_token_id", None) | ||
| if mask_token_id is not None: | ||
| self.gguf_writer.add_mask_token_id(mask_token_id) |
There was a problem hiding this comment.
| mask_token_id = dflash_config.get("mask_token_id", None) | |
| if mask_token_id is not None: | |
| self.gguf_writer.add_mask_token_id(mask_token_id) | |
| mask_token_id = dflash_config.get("mask_token_id", None) | |
| if mask_token_id is not None: | |
| self.hparams["mask_token_id"] = mask_token_id |
I'm not sure of the purpose of separating the token id like this, but this would have gotten overridden by SpecialVocab later on if there already was a mask_token_id in the config.
| if not name.startswith("model."): | ||
| name = "model." + name |
There was a problem hiding this comment.
This belongs in filter_tensors and the two above should have gotten renamed and properly mapped in tensor_mapping.
|
Too late, but follow up please! |
Thanks for the review. Will address these in a follow-up PR. |
|
@ruixiang63 is this CUDA only for now? Using Using Finally, Using Full command: /git/llama.cpp/build/bin/llama-server \
--log-verbosity 4 \
--model "$MODEL_DIR/Qwen3.6-27B-IQ4_XS.gguf" \
--model-draft "$MODEL_DIR/Qwen3.6-27B-DFlash-IQ4_XS.gguf" \
--spec-type draft-dflash \
--spec-draft-n-max 15 \
--cache-type-k-draft q8_0 \
--cache-type-v-draft q8_0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--parallel 1 \
--device Vulkan1,CUDA0 \
--no-mmap \
--mlock \
--fit on \
--fit-target 1,640 \
--ctx-size 65536 \
--flash-attn 1 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--threads 12 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--repeat-penalty 1.0 \
--host 0.0.0.0 \
--seed 42 \
--port 51000 |
|
@ruixiang63 can you link me the q4 k m quantized draft model i can use if there is any please and thank you very much for the work ^_^? |
|
because i'm getting with |
The |
|
Hello, I was looking for people comparing MTP performance vs. DFlash on Qwen3.6-27B but found nothing, so commenting here. I am testing on 2 GPUs For my setup, MTP still outperforms DFlash on code generation (my main use-case).
Interestingly I see that draft acceptance is same for Q8 & IQ4_XS and was expecting perf to be better then on IQ4_XS but tg speed is same. And for people with one GPU only (24GB and 131k context max, thanks to KV cache in Q8), testing on unsloth Q4_K_M with IQ4_XS DFlash draft:
test command (to be adapted for Q4) prompt: NB: for DFlash tests running main model on MTP or non MTP flavors does not change performances, I guess it's just taking more memory for the MTP heads that are not used anyway.... |
|
@gelim Interesting. How many tokens were you trying to predict in both cases? |
MTP run had |
|
can use dflash and mmproj together?like this: |
* spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Yes this is working fine. I tested with adding |
|
@gelim What backend are you using? I'm using Vulkan + 2x 9070XT and with DFlash Vram usage goes through the roof. Around ~5 more gb of vram which goes to shared memory |
|
CUDA on Tesla V100 32gb cards |
|
there‘s still some problems: llama-server --version
version: 9859 (4fc4ec554)
built with GNU 13.3.0 for Linux x86_64when use this script llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
-md Qwen3.6-27B-DFlash-b16.gguf \
--spec-type draft-dflash \
--spec-draft-n-max 7 \
--temp 0 \
--top-k 1 \
-np 1 \
-c 10240 \
--host 0.0.0.0 \
--port 9090 \
--alias Qwen3.6-27B \
-ngl 99 \
-fa on \i get this log: 0.00.075.271 I cmn common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.168.542 I srv load_model: loading model 'Qwen3.6-27B-Q4_K_M.gguf'
0.00.461.531 E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting)
0.00.507.649 W srv load_model: [spec] failed to measure draft model memory: failed to create llama_context from model
0.07.911.111 I srv load_model: initializing, n_slots = 1, n_ctx_slot = 10240, kv_unified = 'false'
0.07.911.142 I common_speculative_impl_draft_dflash: adding speculative implementation 'draft-dflash'
0.07.911.145 I common_speculative_impl_draft_dflash: - n_max=7, n_min=0, p_min=0.00
0.07.911.145 I common_speculative_impl_draft_dflash: - block_size=16, mask_token_id=248070, n_extract=5
0.07.947.129 I srv init: chat template supports preserving reasoning, consider enabling it via --reasoning-preserve
0.07.947.156 I srv llama_server: model loaded
0.07.947.159 I srv llama_server: listening on http://0.0.0.0:9090
0.29.276.782 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
0.29.276.839 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
0.32.631.781 I slot print_timing: id 0 | task 0 | n_decoded = 243, tg = 80.08 t/s, tg_3s = 80.08 t/s
0.35.657.438 I slot print_timing: id 0 | task 0 | n_decoded = 446, tg = 73.60 t/s, tg_3s = 67.09 t/s
0.38.689.574 I slot print_timing: id 0 | task 0 | n_decoded = 591, tg = 65.00 t/s, tg_3s = 47.82 t/s
0.41.711.604 I slot print_timing: id 0 | task 0 | n_decoded = 715, tg = 59.02 t/s, tg_3s = 41.03 t/s
0.44.753.392 I slot print_timing: id 0 | task 0 | n_decoded = 832, tg = 54.90 t/s, tg_3s = 38.46 t/s
0.47.776.519 I slot print_timing: id 0 | task 0 | n_decoded = 950, tg = 52.26 t/s, tg_3s = 39.03 t/s
0.50.815.488 I slot print_timing: id 0 | task 0 | n_decoded = 1070, tg = 50.43 t/s, tg_3s = 39.49 t/s
0.53.832.416 I slot print_timing: id 0 | task 0 | n_decoded = 1279, tg = 52.77 t/s, tg_3s = 69.28 t/s
0.56.852.714 I slot print_timing: id 0 | task 0 | n_decoded = 1652, tg = 60.61 t/s, tg_3s = 123.50 t/s
0.59.859.107 I slot print_timing: id 0 | task 0 | n_decoded = 1940, tg = 64.11 t/s, tg_3s = 95.80 t/s
1.00.035.253 I slot print_timing: id 0 | task 0 | prompt eval time = 320.58 ms / 18 tokens ( 17.81 ms per token, 56.15 tokens per second)
1.00.035.256 I slot print_timing: id 0 | task 0 | eval time = 30437.80 ms / 1945 tokens ( 15.65 ms per token, 63.90 tokens per second)
1.00.035.257 I slot print_timing: id 0 | task 0 | total time = 30758.38 ms / 1963 tokens
1.00.035.260 I slot print_timing: id 0 | task 0 | graphs reused = 679
1.00.035.264 I slot print_timing: id 0 | task 0 | draft acceptance = 0.26239 ( 1260 accepted / 4802 generated), mean len = 2.84
1.00.035.300 I slot release: id 0 | task 0 | stop processing: n_tokens = 1964, truncated = 0the acceptance is only 0.26239, and occurred an error like this: E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting when i use this script: llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
-mm mmproj-BF16.gguf \
-md Qwen3.6-27B-DFlash-b16.gguf \
--spec-type draft-dflash \
--spec-draft-n-max 7 \
--temp 0 \
--top-k 1 \
-np 1 \
-c 10240 \
--host 0.0.0.0 \
--port 9090 \
--alias Qwen3.6-27B \
-ngl 99 \
-fa on \i get this log with some errs: 0.00.075.313 I cmn common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.234.132 I srv load_model: loading model 'Qwen3.6-27B-Q4_K_M.gguf'
0.00.558.158 E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting)
0.00.600.804 W srv load_model: [spec] failed to measure draft model memory: failed to create llama_context from model
0.08.044.149 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.08.044.152 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.08.044.152 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
0.08.462.134 I srv load_model: loaded multimodal model, 'mmproj-BF16.gguf'
0.08.495.743 I srv load_model: initializing, n_slots = 1, n_ctx_slot = 10240, kv_unified = 'false'
0.08.495.783 I common_speculative_impl_draft_dflash: adding speculative implementation 'draft-dflash'
0.08.495.787 I common_speculative_impl_draft_dflash: - n_max=7, n_min=0, p_min=0.00
0.08.495.787 I common_speculative_impl_draft_dflash: - block_size=16, mask_token_id=248070, n_extract=5
0.08.529.704 I srv init: chat template supports preserving reasoning, consider enabling it via --reasoning-preserve
0.08.529.733 I srv llama_server: model loaded
0.08.529.736 I srv llama_server: listening on http://0.0.0.0:9090
0.52.682.801 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
0.52.682.860 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
0.53.107.345 W find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
0.53.107.348 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
0.53.107.349 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 68 new tokens
0.53.107.652 W find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
0.53.227.025 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
0.53.401.721 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 68 new tokens
0.53.500.326 W find_slot: non-consecutive token position 50 after 8 for sequence 0 with 4 new tokens
0.53.500.343 W find_slot: non-consecutive token position 50 after 8 for sequence 0 with 4 new tokens
0.53.538.948 E init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 7
- the tokens for sequence 0 in the input batch have a starting position of Y = 47
it is required that the sequence positions remain consecutive: Y = X + 1
0.53.538.951 E decode: failed to initialize batch
0.53.538.952 E llama_decode: failed to decode, ret = -1
0.53.538.952 E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, offset=0)
0.53.538.955 E srv decode: failed to process speculative batch
0.53.538.978 E srv update_slots: decode() failed: failed to process speculative batch
0.53.538.983 E srv send_error: task id = 0, error: decode() failed: failed to process speculative batch
0.53.538.986 I slot release: id 0 | task 0 | stop processing: n_tokens = 1104, truncated = 0
0.53.539.031 W srv stop: cancel task, id_task = 0the difference is that i add mmproj-BF16.gguf mmproj-BF16.gguf
Qwen3.6-27B-DFlash-b16.gguf
Qwen3.6-27B-DFlash-q8_0.gguf
Qwen3.6-27B-Q4_K_M.gguf
Qwen3.6-27B-UD-Q4_K_XL.gguf
Qwen3.6-27B-UD-Q5_K_XL.gguf
server1.shDFlash's ggufs are from script of |
|
@nipeone you can try to reproduce my tests using the same GGUF ( the "dflash requires ctx_other " warning disappears if you don't need memory fitting with Side note, if changing the drafter to the F16 version, I get 0.77 draft acceptance and 47 tok/sec for my coding use-case. |
* spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
@gelim can single Tesla V100 32gb work well? |
You can judge by the second table I put in a previous comment in this thread. I'm fine with it perf wise. That can of course always be faster but for peer coding it's cool. If the question is should I buy those old cards for my local llm needs. I would not do it. CUDA capabilities are old (7.0). |
Overview
This PR adds DFlash speculative decoding to llama.cpp, achieving up to 8x speedup (Qwen3) with full numerical equivalence to the reference original implementation.
Compared to EAGLE3 - which uses an autoregressive draft and generates one token per draft step, DFlash produces an entire block of candidates in a single draft forward pass, resulting in higher per-iteration draft throughput. However, DFlash relies on multiple transformer layers for its draft model, whereas EAGLE3 uses only a single transformer layer.
There is still quite meaningful headroom for further performance improvements with current implementation, summarized in the Future Performance Work section below.
How to run DFlash in llama.cpp
Step 1: Convert models to GGUF
[Optional] Step 2: Quantize GGUF models
Step 3: Build llama.cpp
Step 4: Run DFlash speculative decoding
./build/bin/llama-server \ -m "${TARGET_MODEL_GGUF}" \ -md "${DFLASH_MODEL_GGUF}" \ --spec-type draft-dflash \ --spec-draft-n-max 15 \ --temp 0 --top-k 1 \ -np 1 \ -c 40960 --port 8080 -ngl 99 -fa on \ --jinjaPerformance Evaluation
Qwen3.6-27BandQwen3.6-27B-dflashare bothQ4_K_M, tested on DGX Spark with SpeedBench with the latest refactoring.Note
After refactoring, the performance data below may differ from current results, especially since
llama-servernow supports DFlash as well. However, the data is still useful for getting a general sense of the speedup DFlash provides.Qwen3-8B
Draft:
z-lab/Qwen3-8B-DFlash(bf16), Target:Qwen/Qwen3-8B(bf16)Qwen3-4B
Draft:
z-lab/Qwen3-4B-DFlash(bf16), Target:Qwen/Qwen3-4B(bf16)GPT-OSS-20B
Draft:
z-lab/gpt-oss-20b-DFlash(bf16), Target:openai/gpt-oss-20b(bf16)For MoE targets (gpt-oss-20b), DFlash speedup is generally smaller than for dense attention targets because more experts get activated during the parallel verification step than during single-token autoregressive decoding (same observation as in #18039 for gpt-oss EAGLE3).
Qwen3.5-4B
Draft:
z-lab/Qwen3.5-4B-DFlash(bf16), Target:Qwen/Qwen3.5-4B(bf16)Speedup is intrinsically limited on hybrid target models:
For Hybrid targets (Qwen3.5, ...), when target verify draft tokens, llama.cpp writes KV / recurrent state for the full[id_last + draft block]before acceptance is known.Pure-attention target models can drop rejected suffixes withseq_rm; hybrid targets cannot, because recurrent state is not decomposable by token position.Current workaround inexamples/speculative-simple/speculative-simple.cpp:snapshot target state before verifyon rejection, restore + replay(rerun target model forward) only the accepted prefix to recover recurrent stateCost: each rejected step requires one extra target forward, which is the main reason hybrid speedup lags pure-attention.Qwen3.5-9B
Draft:
z-lab/Qwen3.5-9B-DFlash(bf16), Target:Qwen/Qwen3.5-9B(bf16)Future Performance Work
### KV cache / graph reuse for the DFlash decoder(resolved with K/V cache copy injection)The DFlash decoder currently rebuilds its graph every iteration (graphs reused = 0). The main cause is thatcross.n_enc(the length ofaccumulated_target_ctx) grows monotonically, which changes the shape oftarget_ctxand invalidates all downstream tensor shapes.Possible improvements:- add a draft-side KV cache to the DFlash decoder.This would make the implementation closer to the original reference: committed target-context K/V would be materialized once and reused across iterations, instead of recomputing K/V from the full accumulated context every step. This reduces draft-side compute and also makes graph shapes much more stable, which should improve graph reuse. Since the DFlash decoder attention includes both cross-attention and self-attention, the current llama.cpp implementation does not support this pattern well.
- keep the current no-cache design, but fix thetarget_ctxinput shape.Instead of letting
target_ctxgrow every iteration, reserve a fixed-size buffer, track the active length separately, and mask out the padded region in attention. This preserves the current semantics while allowing the decoder graph to be reused. This method is not ideal compared to using a KV cache.Hybrid target model performance improvement (For all speculative decoding methods)
Hybrid targets (e.g. Qwen3.5) are slower because the problem is no longer just draft-side graph reuse. During target verify, llama.cpp writes KV / recurrent state for the full draft block before acceptance is known. Pure-attention target models can discard rejected suffixes withseq_rm, but hybrid targets cannot, because their recurrent state is not decomposable by token position.The current workaround is:snapshot the target state before verifyon rejection, restore the snapshotreplay only the accepted prefixThis is correct, but each rejected step may require one extra target forward, which is the main reason hybrid speedup lags pure-attention.A more fundamental future improvement would be target-side deferred commit (SGLang Implementation): verify would compute temporary recurrent states, and only the accepted-prefix state would be committed. That would remove replay from the hybrid path, but it requires deeper changes to llama.cpp’s recurrent-state update flow.
Note this applies to all hybrid models used as target models in speculative decoding methods, not just DFlash.
Updates: Thanks to #19493 and #22227, llama.cpp now supports fallback for hybrid model states.
More (Low Priority)
Requirements