[Speculative decoding] feat: add DFlash support by ruixiang63 · Pull Request #22105 · ggml-org/llama.cpp

ruixiang63 · 2026-04-19T02:55:02Z

Overview

The old PR has been backed up in branch https://github.com/ruixiang63/llama.cpp/tree/dflash-backup

This PR adds DFlash speculative decoding to llama.cpp, achieving up to 8x speedup (Qwen3) with full numerical equivalence to the reference original implementation.

Compared to EAGLE3 - which uses an autoregressive draft and generates one token per draft step, DFlash produces an entire block of candidates in a single draft forward pass, resulting in higher per-iteration draft throughput. However, DFlash relies on multiple transformer layers for its draft model, whereas EAGLE3 uses only a single transformer layer.

There is still quite meaningful headroom for further performance improvements with current implementation, summarized in the Future Performance Work section below.

How to run DFlash in llama.cpp

Step 1: Convert models to GGUF

TARGET_MODEL_HF="${MODELS_DIR}/Qwen3-8B"
TARGET_MODEL_GGUF="${MODELS_DIR}/Qwen3-8B.gguf"
DFLASH_MODEL_HF="${MODELS_DIR}/Qwen3-8B-DFlash-b16"
DFLASH_MODEL_GGUF="${MODELS_DIR}/Qwen3-8B-DFlash-b16.gguf"

python convert_hf_to_gguf.py \
    "${TARGET_MODEL_HF}" \
    --outtype bf16 \
    --outfile "${TARGET_MODEL_GGUF}"

python convert_hf_to_gguf.py \
    "${DFLASH_MODEL_HF}" \
    --outtype bf16 \
    --target-model-dir "${TARGET_MODEL_HF}" \
    --outfile "${DFLASH_MODEL_GGUF}"

[Optional] Step 2: Quantize GGUF models

TARGET_MODEL_GGUF="${MODELS_DIR}/Qwen3-8B.gguf"
DFLASH_MODEL_GGUF="${MODELS_DIR}/Qwen3-8B-DFlash-b16.gguf"

./build/bin/llama-quantize \
  ${TARGET_MODEL_GGUF} \
  ${TARGET_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M
 
./build/bin/llama-quantize \
  ${DFLASH_MODEL_GGUF} \
  ${DFLASH_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M

Step 3: Build llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Step 4: Run DFlash speculative decoding

./build/bin/llama-server \
    -m  "${TARGET_MODEL_GGUF}" \
    -md "${DFLASH_MODEL_GGUF}" \
    --spec-type draft-dflash \
    --spec-draft-n-max 15 \
    --temp 0 --top-k 1 \
    -np 1 \
    -c 40960 --port 8080 -ngl 99 -fa on \
    --jinja

Performance Evaluation

Qwen3.6-27B and Qwen3.6-27B-dflash are both Q4_K_M, tested on DGX Spark with SpeedBench with the latest refactoring.

Comparison: baseline=Qwen3.6-27B-baseline.json speculative=Qwen3.6-27B-dflash.json
category       base_avg_pred_t/s  spec_avg_pred_t/s  decode_speedup  base_avg_latency  spec_avg_latency  latency_speedup  accept_rate
-------------  -----------------  -----------------  --------------  ----------------  ----------------  ---------------  -----------
coding         12.63              39.32              3.11x           20.712s           7.444s            2.78x            0.3097     
humanities     12.60              28.31              2.25x           31.056s           14.606s           2.13x            0.1985     
math           12.58              30.45              2.42x           20.495s           8.553s            2.40x            0.2263     
qa             12.57              27.62              2.20x           19.307s           9.914s            1.95x            0.1900     
rag            12.54              51.07              4.07x           31.629s           8.676s            3.65x            0.4422     
reasoning      12.56              30.42              2.42x           20.616s           8.653s            2.38x            0.2263     
stem           12.56              30.37              2.42x           20.487s           8.537s            2.40x            0.2263     
writing        12.52              33.33              2.66x           21.724s           9.231s            2.35x            0.2561     
multilingual   12.56              32.11              2.56x           20.731s           8.685s            2.39x            0.2300     
summarization  12.56              30.35              2.42x           20.732s           8.823s            2.35x            0.2269     
roleplay       12.57              38.02              3.02x           51.993s           18.996s           2.74x            0.2829     
overall        12.57              33.76              2.69x           25.407s           10.193s           2.49x            0.2516

Note

After refactoring, the performance data below may differ from current results, especially since llama-server now supports DFlash as well. However, the data is still useful for getting a general sense of the speedup DFlash provides.

Qwen3-8B

Draft: z-lab/Qwen3-8B-DFlash (bf16), Target: Qwen/Qwen3-8B (bf16)

Prompt	block_size	Baseline (t/s)	DFlash w/ thinking (t/s)	Speedup	Accept Rate	DFlash w/o thinking (t/s)	Speedup	Accept Rate
Write a quicksort algorithm in Python. Write code only.	16	51.9	92.0	1.77x	12.0%	419.3	8.08x	93.3%
Explain the Pythagorean theorem	16	51.6	95.7	1.85x	13.7%	133.8	2.59x	20.9%
Plan a 1 day trip to DC	16	51.6	56.5	1.09x	4.9%	76.7	1.49x	8.9%

Qwen3-4B

Draft: z-lab/Qwen3-4B-DFlash (bf16), Target: Qwen/Qwen3-4B (bf16)

Prompt	block_size	Baseline (t/s)	DFlash w/ thinking (t/s)	Speedup	Accept Rate	DFlash w/o thinking (t/s)	Speedup	Accept Rate
Write a quicksort algorithm in Python. Write code only.	16	91.0	138.9	1.53x	11.3%	537.9	5.91x	93.3%
Explain the Pythagorean theorem	16	91.1	130.5	1.43x	11.3%	187.3	2.06x	18.0%
Plan a 1 day trip to DC	16	91.2	102.3	1.12x	6.9%	123.7	1.36x	9.0%

GPT-OSS-20B

Draft: z-lab/gpt-oss-20b-DFlash (bf16), Target: openai/gpt-oss-20b (bf16)

Prompt	block_size	Baseline (t/s)	DFlash w/ thinking (t/s)	Speedup	Accept	DFlash w/o thinking (t/s)	Speedup	Accept
Write a quicksort algorithm in Python. Write code only.	8	171.0	167.9	0.98x	38.0%	216.8	1.27x	55.6%
Explain the Pythagorean theorem	8	172.0	147.0	0.85x	31.0%	178.0	1.03x	42.0%
Plan a 1 day trip to DC	8	171.9	105.2	0.61x	16.6%	120.5	0.70x	21.9%

For MoE targets (gpt-oss-20b), DFlash speedup is generally smaller than for dense attention targets because more experts get activated during the parallel verification step than during single-token autoregressive decoding (same observation as in #18039 for gpt-oss EAGLE3).

Qwen3.5-4B

Draft: z-lab/Qwen3.5-4B-DFlash (bf16), Target: Qwen/Qwen3.5-4B (bf16)

Prompt	block_size	Baseline (t/s)	DFlash w/ thinking (t/s)	Speedup	Accept	DFlash w/o thinking (t/s)	Speedup	Accept
Write a quicksort algorithm in Python. Write code only.	16	82.4	131.7	1.60x	36.3%	293.9	3.57x	85.7%
Explain the Pythagorean theorem	16	81.9	124.0	1.51x	34.1%	120.7	1.47x	38.2%
Plan a 1 day trip to DC	16	81.3	102.1	1.26x	26.7%	75.5	0.93x	17.7%

Speedup is intrinsically limited on hybrid target models:

~~For Hybrid targets (Qwen3.5, ...), when target verify draft tokens, llama.cpp writes KV / recurrent state for the full [id_last + draft block] before acceptance is known.~~
~~Pure-attention target models can drop rejected suffixes with seq_rm; hybrid targets cannot, because recurrent state is not decomposable by token position.~~
~~Current workaround in examples/speculative-simple/speculative-simple.cpp:~~
- ~~snapshot target state before verify~~
- ~~on rejection, restore + replay(rerun target model forward) only the accepted prefix to recover recurrent state~~
~~Cost: each rejected step requires one extra target forward, which is the main reason hybrid speedup lags pure-attention.~~
Thanks to speculative checkpointing server : speculative checkpointing #19493 and speculative-simple : add checkpoint support #22227, llama.cpp now supports fallback for hybrid model states.

Qwen3.5-9B

Draft: z-lab/Qwen3.5-9B-DFlash (bf16), Target: Qwen/Qwen3.5-9B (bf16)

Prompt	block_size	Baseline (t/s)	DFlash w/ thinking (t/s)	Speedup	Accept	DFlash w/o thinking (t/s)	Speedup	Accept
Write a quicksort algorithm in Python. Write code only.	16	47.6	64.0	1.34x	30.5%	131.8	2.77x	55.2%
Explain the Pythagorean theorem	16	48.0	91.2	1.90x	44.3%	88.6	1.85x	35.1%
Plan a 1 day trip to DC	16	48.0	69.6	1.45x	25.7%	52.5	1.10x	17.7%

Future Performance Work

~~### KV cache / graph reuse for the DFlash decoder~~ (resolved with K/V cache copy injection)
The DFlash decoder currently rebuilds its graph every iteration (graphs reused = 0). The main cause is that cross.n_enc (the length of accumulated_target_ctx) grows monotonically, which changes the shape of target_ctx and invalidates all downstream tensor shapes.

~~Possible improvements:~~

- add a draft-side KV cache to the DFlash decoder.
This would make the implementation closer to the original reference: committed target-context K/V would be materialized once and reused across iterations, instead of recomputing K/V from the full accumulated context every step. This reduces draft-side compute and also makes graph shapes much more stable, which should improve graph reuse. Since the DFlash decoder attention includes both cross-attention and self-attention, the current llama.cpp implementation does not support this pattern well.

- keep the current no-cache design, but fix the target_ctx input shape.
Instead of letting target_ctx grow every iteration, reserve a fixed-size buffer, track the active length separately, and mask out the padded region in attention. This preserves the current semantics while allowing the decoder graph to be reused. This method is not ideal compared to using a KV cache.

Hybrid target model performance improvement (For all speculative decoding methods)

Hybrid targets (e.g. Qwen3.5) are slower because the problem is no longer just draft-side graph reuse. During target verify, llama.cpp writes KV / recurrent state for the full draft block before acceptance is known. Pure-attention target models can discard rejected suffixes with seq_rm, but hybrid targets cannot, because their recurrent state is not decomposable by token position.

~~The current workaround is:~~

~~snapshot the target state before verify~~
~~on rejection, restore the snapshot~~
~~replay only the accepted prefix~~

~~This is correct, but each rejected step may require one extra target forward, which is the main reason hybrid speedup lags pure-attention.~~
A more fundamental future improvement would be target-side deferred commit (SGLang Implementation): verify would compute temporary recurrent states, and only the accepted-prefix state would be committed. That would remove replay from the hybrid path, but it requires deeper changes to llama.cpp’s recurrent-state update flow.
Note this applies to all hybrid models used as target models in speculative decoding methods, not just DFlash.

Updates: Thanks to #19493 and #22227, llama.cpp now supports fallback for hybrid model states.

More (Low Priority)

Draft-side sampling fast path: For greedy / no-grammar mode, batch argmax over the entire drafted block instead of invoking the sampler one token at a time.
CUDA graph for both draft model and target model
....

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, use Claude to help discuss and design the DFlash architecture, ask clarifying questions, and assist with writing tests. Everything remains under my control, and I reviewed every single line of AI-generated code.

ggml-gh-bot · 2026-04-19T02:58:33Z

Hi @ruixiang63, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

am17an · 2026-04-19T03:28:27Z

I think the method of exposing the hidden states of the target model needs to be cleaner, as it's used in both eagle3 and dflash and I guess even MTP. Probably needs a refactoring to expose these endpoints

ruixiang63 · 2026-04-19T03:31:06Z

I think the method of exposing the hidden states of the target model needs to be cleaner, as it's used in both eagle3 and dflash and I guess even MTP. Probably needs a refactoring to expose these endpoints

@ggerganov has already worked on this refactoring work. And you’re very welcome to contribute if you have any better ideas for this PR :)

noonghunna · 2026-04-20T16:42:39Z

Trying this against RedHatAI/gemma-4-31B-it-speculator.dflash on Ampere (2× RTX 3090, sm_86, CUDA 12.9) — ran into a gap worth flagging for speculators-format drafts.

Issue 1 (small, easy): `d2t` / `t2d` not handled in DFlashModel

The EAGLE3 path at convert_hf_to_gguf.py lines ~2923-2931 stashes d2t as int64 and drops t2d. DFlashModel (subclass of Qwen3Model) doesn't replicate that — fails with Can not map tensor 'model.d2t' on any speculators-format draft.

Fix that worked locally for us:

```python

in DFlashModel.modify_tensors, before the super() fallthrough

if name == "d2t":
if not hasattr(self, "_eagle3_int_tensors"):
self._eagle3_int_tensors = {}
self._eagle3_int_tensors[name] = data_torch
if not hasattr(self, "is_eagle3"):
self.is_eagle3 = True
return
if name == "t2d":
return
```

Piggy-backs on the EAGLE3 int-tensor emit in prepare_tensors.

Issue 2 (bigger): `gguf.MODEL_ARCH.DFLASH` tensor list is missing `TOKEN_EMBD` and `OUTPUT`

`gguf-py/gguf/constants.py` line 3578+ registers `DFLASH_FC` + `DFLASH_HIDDEN_NORM` plus the transformer layers, but not `TOKEN_EMBD` / `OUTPUT`. Works for z-lab-format drafts (they share the target's full vocab — we confirmed `z-lab/Qwen3.6-35B-A3B-DFlash` converts cleanly), but speculators-format drafts carry their own reduced-vocab embeddings — Red Hat's Gemma-4 draft has `draft_vocab_size=32000` vs Gemma's ~300K, with its own `embed_tokens.weight` + `lm_head.weight` + `d2t`/`t2d` remap.

After patching Issue 1, conversion now fails at `Can not map tensor 'model.embed_tokens.weight'`.

Fixing this end-to-end isn't just adding the arch constants — inference also needs to use the draft's own embeddings plus the `d2t` table to map draft-vocab logits back to target-vocab ids during verify. Flagging here rather than filing a separate issue since it's narrow.

AI-assisted (Claude) findings, human review & submit.

ruixiang63 · 2026-04-24T18:37:40Z

Rebased onto the latest master. Hybrid target models (e.g. Qwen3.5) now benefit from the speculative checkpointing mechanism recently merged upstream and the DFlash performance gets better. PR description updated with the new performance numbers.

ivanbaldo · 2026-04-24T18:44:21Z

Have you also looked at DDTree perhaps?

ruixiang63 · 2026-04-24T18:50:39Z

Have you also looked at DDTree perhaps?

Not yet, but I’ll take a look. I’d expect it to come after this PR gets merged.

kroaton · 2026-04-24T23:10:16Z

Out of curiosity, have you tested quantizing the DFlash model to Q8?
I found this by mistake yesterday https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF, maybe worth a shot at supporting .gguf for DFlash instead of just .safetensors?

https://huggingface.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test

JonhJonhD · 2026-04-26T16:21:01Z

I don’t know if this is useful, but I managed to get it working on AMD (though with poor performance)

Main GPU: R9700 AI PRO running Unsloth Q5 Qwen 27B 3.5 (Vulkan backend) + DFlash bf16 compiled GGUF

The acceptance rate works well with the current parameters; changing them does affect the rate.

The --device Vulkan0,Cuda0 and --tensor-split 1,0 parameters are necessary to make things run otherwise it fail because the GGML scheduler currently does not handle cross-buffer / cross-device operations cleanly for this feature.

It actually runs. Here’s the command and the result. If you’d like me to test something specific that might help, just let me know. I’m clearly out of my depth and can’t really suggest improvements.

Command + result + evaluation :

Click to expand

 ./llama-speculative-simple \
  -m ~/models/qwen3.5/Qwen3.5-27B-Q5_K_M.gguf \
  --model-draft ~/models/qwen3.5/Qwen3.5-27BDFlash.gguf \
  --device Vulkan0,Cuda0 \
  --tensor-split 1,0 \
  -cd 8192 \
  --draft-max 2 \
   --temp 0.8 \
   --top-p 0.9 \
   --top-k 40 \
   --presence_penalty 1.2 \
  -ngl 99 \
  --ctx-size 1572 \
  --flash-attn on \
  -ub 256 \
  -b 1024 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --dflash
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15949 MiB):
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, VRAM: 15949 MiB
WARNING: radv is not a conformant Vulkan implementation, testing use only.
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB]                    | total    free     self   model   context   compute       unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (AI PRO R9700 (RADV GFX1201)) | 32768 = 31424 + (18341 = 17856 +     209 +     275) + 17592186027418 |
common_memory_breakdown_print: |   - CUDA0 (RTX 4060 Ti)                   | 15949 = 15809 + (    0 =     0 +       0 +       0) +            140 |
common_memory_breakdown_print: |   - Host                                  |                    851 =   833 +       0 +      18                   |
common_params_fit_impl: projected memory use with initial parameters [MiB]:
common_params_fit_impl:   - Vulkan0 (AMD Radeon AI PRO R9700 (RADV GFX1201)):  32768 total,  18341 used,  13083 free vs. target of   1024
common_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 4060 Ti)              :  15949 total,      0 used,  15809 free vs. target of   1024
common_params_fit_impl: projected to use 18341 MiB of device memory vs. 47233 MiB of free device memory
common_params_fit_impl: targets for free memory can be met on all devices, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.43 seconds
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon AI PRO R9700 (RADV GFX1201)) (0000:08:00.0) - 31726 MiB free
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) (0000:04:00.0) - 15809 MiB free
llama_model_loader: loaded meta data with 49 key-value pairs and 851 tensors from ~/models/qwen3.5/Qwen3.5-27B-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.5 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-27B
llama_model_loader: - kv  16:                               general.tags arr[str,3]       = ["qwen3_5_moe", "unsloth", "image-tex...
llama_model_loader: - kv  17:                         qwen35.block_count u32              = 64
llama_model_loader: - kv  18:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  19:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  20:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  21:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  22:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  24:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  25:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  27:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  28:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  29:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  30:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  31:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  32:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  33:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  34:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  35:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  36:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  37:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - kv  44:                          general.file_type u32              = 17
llama_model_loader: - kv  45:                      quantize.imatrix.file str              = Qwen3.5-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  46:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.5-27B.txt
llama_model_loader: - kv  47:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  48:              quantize.imatrix.chunks_count u32              = 80
llama_model_loader: - type  f32:  353 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type q5_K:  263 tensors
llama_model_loader: - type q6_K:  139 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 18.25 GiB (5.83 BPW) 
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 64
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 27B
print_info: model params          = 26.90 B
print_info: general.name          = Qwen3.5-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   833.59 MiB
load_tensors:      Vulkan0 model buffer size = 17856.52 MiB
.............................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 1792
llama_context: n_ctx_seq     = 1792
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 256
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (1792) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.95 MiB
llama_kv_cache:    Vulkan0 KV buffer size =    59.50 MiB
llama_kv_cache: size =   59.50 MiB (  1792 cells,  16 layers,  1/1 seqs), K (q8_0):   29.75 MiB, V (q8_0):   29.75 MiB
llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 256
llama_memory_recurrent:    Vulkan0 RS buffer size =   149.62 MiB
llama_memory_recurrent: size =  149.62 MiB (     1 cells,  64 layers,  1 seqs), R (f32):    5.62 MiB, S (f32):  144.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:    Vulkan0 compute buffer size =   275.60 MiB
sched_reserve: Vulkan_Host compute buffer size =    18.14 MiB
sched_reserve: graph nodes  = 3849
sched_reserve: graph splits = 2
sched_reserve: reserve took 23.36 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
common_context_can_seq_rm: the target context does not support partial sequence removal
speculative decoding will use checkpoints (context does not support partial sequence removal)
llama_model_load_from_file_impl: skipping device Vulkan1 (NVIDIA GeForce RTX 4060 Ti) with id 0000:04:00.0 - already using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) with the same id
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) (0000:04:00.0) - 15809 MiB free
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon AI PRO R9700 (RADV GFX1201)) (0000:08:00.0) - 13380 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 58 tensors from ~/models/qwen3.5/Qwen3.5-27BDFlash.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = dflash
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3.5 DFlash
llama_model_loader: - kv   3:                         general.size_label str              = 1.7B
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                               general.tags arr[str,8]       = ["dflash", "speculative-decoding", "d...
llama_model_loader: - kv   6:                         dflash.block_count u32              = 5
llama_model_loader: - kv   7:                      dflash.context_length u32              = 262144
llama_model_loader: - kv   8:                    dflash.embedding_length u32              = 5120
llama_model_loader: - kv   9:                 dflash.feed_forward_length u32              = 17408
llama_model_loader: - kv  10:                dflash.attention.head_count u32              = 32
llama_model_loader: - kv  11:             dflash.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                      dflash.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  13:    dflash.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                dflash.attention.key_length u32              = 128
llama_model_loader: - kv  15:              dflash.attention.value_length u32              = 128
llama_model_loader: - kv  16:                          general.file_type u32              = 32
llama_model_loader: - kv  17:                          dflash.block_size u32              = 16
llama_model_loader: - kv  18:                    dflash.target_layer_ids arr[i32,5]       = [2, 17, 32, 47, 62]
llama_model_loader: - kv  19:                       dflash.mask_token_id u32              = 248070
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - type  f32:   22 tensors
llama_model_loader: - type bf16:   36 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 3.22 GiB (16.00 BPW) 
load_hparams: DFlash extract_layers = [2, 17, 32, 47, 62]
load_hparams: DFlash block_size = 16, mask_token_id = 248070
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = dflash
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 5
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = ?B
print_info: model params          = 1.73 B
print_info: general.name          = Qwen3.5 DFlash
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 4 repeating layers to GPU
load_tensors: offloaded 6/6 layers to GPU
load_tensors:        CUDA0 model buffer size =  3300.24 MiB
.............................
set_dflash: DFlash extraction enabled for layers [2, 17, 32, 47, 62]
main: DFlash chat template applied


<|im_start|>user
<|im_end|>
<|im_start|>assistant
<think>
sched_reserve: reserving ...
sched_reserve:    Vulkan0 compute buffer size =   417.94 MiB
sched_reserve: Vulkan_Host compute buffer size =    18.14 MiB
sched_reserve: graph nodes  = 3849
sched_reserve: graph splits = 2
sched_reserve: reserve took 10.63 ms, sched copies = 4
Okayllama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 1792
llama_context: n_ubatch      = 256
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.97 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   105.00 MiB
sched_reserve:  CUDA_Host compute buffer size =   100.00 MiB
sched_reserve: graph nodes  = 3
sched_reserve: graph splits = 1
sched_reserve: reserve took 53.57 ms, sched copies = 4
llama_init_from_model: DFlash auto-setup: using target model's embedding + lm_head layers
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 1792
llama_context: n_ubatch      = 256
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.97 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   776.13 MiB
sched_reserve:    Vulkan0 compute buffer size =   262.50 MiB
sched_reserve:  CUDA_Host compute buffer size =   650.13 MiB
sched_reserve: graph nodes  = 180
sched_reserve: graph splits = 3
sched_reserve: reserve took 328.44 ms, sched copies = 4
, the user just sent a message with no content. Let me check if there's something missing or if they need assistance. Maybe they encountered an issue or want to ask a question but forgot to type it. I should respond politely to prompt them for more details. Let me make sure to keep it friendly and open-ended so they feel comfortable providing more information.
</think>

Hello! It seems like your message might have come through empty. How can I assist you today? Feel free to ask a question, share a topic, or let me know if you need help with anything specific! ��

encoded   10 tokens in    0.581 seconds, speed:   17.218 t/s
decoded  123 tokens in    7.577 seconds, speed:   16.234 t/s

n_draft   = 2
n_predict = 123
n_drafted = 61
n_accept  = 61
accept    = 100.000%

draft:


target:

common_perf_print:    sampling time =      31.88 ms
common_perf_print:    samplers time =      26.80 ms /    38 tokens
common_perf_print:        load time =    4482.16 ms
common_perf_print: prompt eval time =    3903.13 ms /   154 tokens (   25.35 ms per token,    39.46 tokens per second)
common_perf_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
common_perf_print:       total time =    9866.22 ms /   155 tokens
common_perf_print: unaccounted time =    5931.20 ms /  60.1 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =         71
common_memory_breakdown_print: | memory breakdown [MiB]                    | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (AI PRO R9700 (RADV GFX1201)) | 32768 = 12974 + (18483 = 17856 +     209 +     417) +        1310 |
common_memory_breakdown_print: |   - CUDA0 (RTX 4060 Ti)                   | 15949 = 11591 + (    0 =     0 +       0 +       0) +        4358 |
common_memory_breakdown_print: |   - Host                                  |                    851 =   833 +       0 +      18                |
llama_perf_context_print:        load time =    2228.10 ms
llama_perf_context_print: prompt eval time =    1077.04 ms /   122 tokens (    8.83 ms per token,   113.27 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    9793.20 ms /   123 tokens
llama_perf_context_print:    graphs reused =          0

SunYong0821 · 2026-04-27T03:39:17Z

--dflash

llama.cpp-b8941 does not have this parameter.

rlex · 2026-04-27T04:02:28Z

llama.cpp-b8941 does not have this parameter.

Because it's not merged yet to master branch?

SunYong0821 · 2026-04-27T05:42:43Z

llama.cpp-b8941 does not have this parameter.

Because it's not merged yet to master branch?

When is it expected to be merged into master?

Raghuboi · 2026-04-27T12:18:57Z

getting this startup error: /home/raghuboi/llama.cpp/src/llama-context.cpp:2509: GGML_ASSERT(tensor != nullptr && "DFlash extraction tensor is null") failed

tried these models:
Qwen 3 1.7B
Qwen 3.5 2b

meanwhile https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF fails to load on startup: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'dflash-draft' llama_model_load_from_file_impl: failed to load model srv load_model: failed to load draft model, '/home/raghuboi/Desktop/models/qwen-3.6/dflash-draft-3.6-q8_0.gguf' srv operator(): operator(): cleaning up before exit...

set_dflash: DFlash extraction enabled for layers [0, 0, 0, 0, 0]
srv load_model: DFlash feature extraction enabled on target model
srv load_model: initializing slots, n_slots = 1
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 4981.19 MiB
sched_reserve: CUDA1 compute buffer size = 2373.19 MiB
sched_reserve: CUDA2 compute buffer size = 3119.20 MiB
sched_reserve: CUDA_Host compute buffer size = 4137.22 MiB
sched_reserve: graph nodes = 3849
sched_reserve: graph splits = 4
sched_reserve: reserve took 938.14 ms, sched copies = 4
/home/raghuboi/llama.cpp/src/llama-context.cpp:2509: GGML_ASSERT(tensor != nullptr && "DFlash extraction tensor is null") failed

exec "$LLAMA_SERVER"
-m "$MODEL"
-md "/home/raghuboi/Desktop/models/qwen-3.6/Qwen3-1.7B-Q8_0.gguf"
--dflash
--draft-max 16
--alias "qwen3.6"
--host 0.0.0.0 --port 8081
-ngl 99
-np 1
-kvu
--fit on
--fit-target 1024
--fit-ctx 262144
-c 262144
-cd 4096
--split-mode layer
-b 2048 -ub 1024
--main-gpu 1
--mlock
--no-mmap
--flash-attn on
--chat-template-kwargs '{"preserve_thinking":true}'
--cache-type-k q8_0 --cache-type-v q8_0
--cache-ram 38912
--ctx-checkpoints 128
--cont-batching
--jinja
--threads 8
--threads-batch 16
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.0 --presence-penalty 0.0 \

"$LOG" 2>&1

mbednarek360 · 2026-04-27T12:32:26Z

I'm getting the following error trying to run this PR with the Vulkan backend on an R9700, only one token is generated before it crashes:

GGML_ASSERT(cparams.n_ubatch >= n_tokens && "encoder requires n_ubatch >= n_tokens") failed

Full Log

Apr 27 08:11:21 michael-server llama-server[888649]: main: starting router server, no model will be loaded in this process
Apr 27 08:11:21 michael-server llama-server[888649]: start: binding port with default address family
Apr 27 08:11:21 michael-server llama-server[888649]: main: router server is listening on http://127.0.0.1:8090
Apr 27 08:11:21 michael-server llama-server[888649]: main: NOTE: router mode is experimental
Apr 27 08:11:21 michael-server llama-server[888649]: main:       it is not recommended to use this mode in untrusted environments
Apr 27 08:11:44 michael-server llama-server[888649]: srv  ensure_model: model name=Qwen3.6 is not loaded, loading...
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load: spawning server instance with name=Qwen3.6 on port 39385
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load: spawning server instance with args:
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/bin/llama-server
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --cache-reuse
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   256
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --chat-template-file
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   /nix/store/wg2gx0sf317iisfy17ap6vmkyrk8qd4y-qwen-36.jinja
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --chat-template-kwargs
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   {"preserve_thinking": true}
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --no-context-shift
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --dflash
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --host
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   127.0.0.1
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --jinja
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --keep
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   -1
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --min-p
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   0.0
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --no-mmproj-auto
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --port
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   39385
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --presence-penalty
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   0.0
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --prio
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   3
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --repeat-penalty
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   1.0
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --temperature
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   0.6
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --top-k
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   20
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --top-p
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   0.95
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --alias
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   Qwen3.6
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --ctx-size-draft
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   2048
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --cache-type-k
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   q8_0
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --cache-type-v
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   q8_0
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --flash-attn
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   on
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --fit
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   off
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --hf-repo
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   bartowski/Qwen_Qwen3.6-27B-GGUF:Q6_K_L
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --model-draft
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   /media/Downloads/Qwen3.6-dflash.gguf
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --n-gpu-layers
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   99
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --n-gpu-layers-draft
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   99
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --parallel
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   1
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   --reasoning
Apr 27 08:11:44 michael-server llama-server[888649]: srv          load:   on
Apr 27 08:11:44 michael-server llama-server[888649]: srv  ensure_model: waiting until model name=Qwen3.6 is fully loaded...
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] WARNING: radv is not a conformant Vulkan implementation, testing use only.
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] common_download_file_single_online: HEAD failed, status: 404
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] no remote preset found, skipping
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] build_info: b0-unknown
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] Running without SSL
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] init: using 31 threads for HTTP server
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] start: binding port with default address family
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] main: loading model
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] srv    load_model: loading model '/var/cache/llama-cpp/models--bartowski--Qwen_Qwen3.6-27B-GGUF/snapshots/f73b625d7ceedbd05d14a93874387cd3bcd673b7/Qwen_Qwen3.6-27B-Q6_K_L.gguf'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon AI PRO R9700 (RADV GFX1201)) (0000:0a:00.0) - 32535 MiB free
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: loaded meta data with 45 key-value pairs and 851 tensors from /var/cache/llama-cpp/models--bartowski--Qwen_Qwen3.6-27B-GGUF/snapshots/f73b625d7ceedbd05d14a93874387cd3bcd673b7/Qwen_Qwen3.6-27B-Q6_K_L.gguf (version GGUF V3 (latest))
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   0:                       general.architecture str              = qwen35
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   1:                               general.type str              = model
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   5:                               general.name str              = Qwen3.6 27B
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   7:                         general.size_label str              = 27B
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   8:                            general.license str              = apache-2.0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  10:                               general.tags arr[str,1]       = ["image-text-to-text"]
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  11:                         qwen35.block_count u32              = 64
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  12:                      qwen35.context_length u32              = 262144
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  13:                    qwen35.embedding_length u32              = 5120
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  14:                 qwen35.feed_forward_length u32              = 17408
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  15:                qwen35.attention.head_count u32              = 24
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  16:             qwen35.attention.head_count_kv u32              = 4
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  17:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  18:                      qwen35.rope.freq_base f32              = 10000000.000000
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  19:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  20:                qwen35.attention.key_length u32              = 256
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  21:              qwen35.attention.value_length u32              = 256
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  22:                     qwen35.ssm.conv_kernel u32              = 4
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  23:                      qwen35.ssm.state_size u32              = 128
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  24:                     qwen35.ssm.group_count u32              = 16
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  25:                  qwen35.ssm.time_step_rank u32              = 48
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  26:                      qwen35.ssm.inner_size u32              = 6144
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  27:             qwen35.full_attention_interval u32              = 4
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  28:                qwen35.rope.dimension_count u32              = 64
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen35
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 248046
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 248044
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 248044
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = false
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  39:               general.quantization_version u32              = 2
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  40:                          general.file_type u32              = 18
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  41:                      quantize.imatrix.file str              = /models_out/Qwen3.6-27B-GGUF/Qwen_Qwe...
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 496
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 802
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - type  f32:  449 tensors
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - type q8_0:  122 tensors
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] llama_model_loader: - type q6_K:  280 tensors
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: file format = GGUF V3 (latest)
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: file type   = Q6_K
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: file size   = 22.19 GiB (7.09 BPW)
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load: 0 unused tokens
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load: printing all EOG tokens:
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load:   - 248044 ('<|endoftext|>')
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load:   - 248046 ('<|im_end|>')
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load:   - 248063 ('<|fim_pad|>')
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load:   - 248064 ('<|repo_name|>')
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load:   - 248065 ('<|file_sep|>')
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load: special tokens cache size = 33
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load: token to piece cache size = 1.7581 MB
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: arch                  = qwen35
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: vocab_only            = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: no_alloc              = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_ctx_train           = 262144
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_embd                = 5120
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_embd_inp            = 5120
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_layer               = 64
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_head                = 24
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_head_kv             = 4
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_rot                 = 64
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_swa                 = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: is_swa_any            = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_embd_head_k         = 256
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_embd_head_v         = 256
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_gqa                 = 6
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_embd_k_gqa          = 1024
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_embd_v_gqa          = 1024
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: f_norm_eps            = 0.0e+00
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: f_norm_rms_eps        = 1.0e-06
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: f_clamp_kqv           = 0.0e+00
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: f_max_alibi_bias      = 0.0e+00
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: f_logit_scale         = 0.0e+00
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: f_attn_scale          = 0.0e+00
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_ff                  = 17408
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_expert              = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_expert_used         = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_expert_groups       = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_group_used          = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: causal attn           = 1
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: pooling type          = -1
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: rope type             = 40
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: rope scaling          = linear
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: freq_base_train       = 10000000.0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: freq_scale_train      = 1
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_ctx_orig_yarn       = 262144
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: rope_yarn_log_mul     = 0.0000
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: rope_finetuned        = unknown
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: mrope sections        = [11, 11, 10, 0]
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: ssm_d_conv            = 4
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: ssm_d_inner           = 6144
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: ssm_d_state           = 128
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: ssm_dt_rank           = 48
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: ssm_n_group           = 16
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: ssm_dt_b_c_rms        = 0
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: model type            = 27B
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: model params          = 26.90 B
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: general.name          = Qwen3.6 27B
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: vocab type            = BPE
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_vocab               = 248320
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: n_merges              = 247587
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: BOS token             = 248044 '<|endoftext|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: EOS token             = 248046 '<|im_end|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: EOT token             = 248046 '<|im_end|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: PAD token             = 248044 '<|endoftext|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: LF token              = 198 'Ċ'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: FIM MID token         = 248061 '<|fim_middle|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: FIM PAD token         = 248063 '<|fim_pad|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: FIM REP token         = 248064 '<|repo_name|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: FIM SEP token         = 248065 '<|file_sep|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248044 '<|endoftext|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248046 '<|im_end|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248063 '<|fim_pad|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248064 '<|repo_name|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248065 '<|file_sep|>'
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] print_info: max token length      = 256
Apr 27 08:11:44 michael-server llama-server[888649]: [39385] load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
Apr 27 08:11:59 michael-server llama-server[888649]: [39385] load_tensors: offloading output layer to GPU
Apr 27 08:11:59 michael-server llama-server[888649]: [39385] load_tensors: offloading 63 repeating layers to GPU
Apr 27 08:11:59 michael-server llama-server[888649]: [39385] load_tensors: offloaded 65/65 layers to GPU
Apr 27 08:11:59 michael-server llama-server[888649]: [39385] load_tensors:   CPU_Mapped model buffer size =  1288.28 MiB
Apr 27 08:11:59 michael-server llama-server[888649]: [39385] load_tensors:      Vulkan0 model buffer size = 21436.81 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] ...........................................................................................
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] common_init_result: added <|endoftext|> logit bias = -inf
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] common_init_result: added <|im_end|> logit bias = -inf
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] common_init_result: added <|fim_pad|> logit bias = -inf
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] common_init_result: added <|repo_name|> logit bias = -inf
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] common_init_result: added <|file_sep|> logit bias = -inf
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: constructing llama_context
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: n_seq_max     = 1
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: n_ctx         = 262144
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: n_ctx_seq     = 262144
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: n_batch       = 2048
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: n_ubatch      = 512
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: causal_attn   = 1
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: flash_attn    = enabled
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: kv_unified    = false
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: freq_base     = 10000000.0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: freq_scale    = 1
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_context: Vulkan_Host  output buffer size =     0.95 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_kv_cache:    Vulkan0 KV buffer size =  8704.00 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_kv_cache: size = 8704.00 MiB (262144 cells,  16 layers,  1/1 seqs), K (q8_0): 4352.00 MiB, V (q8_0): 4352.00 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 256
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 256
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_memory_recurrent:    Vulkan0 RS buffer size =   149.62 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_memory_recurrent: size =  149.62 MiB (     1 cells,  64 layers,  1 seqs), R (f32):    5.62 MiB, S (f32):  144.00 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: reserving ...
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: resolving fused Gated Delta Net support:
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: fused Gated Delta Net (autoregressive) enabled
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: fused Gated Delta Net (chunked) enabled
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve:    Vulkan0 compute buffer size =   840.28 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: Vulkan_Host compute buffer size =   532.29 MiB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: graph nodes  = 3849
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: graph splits = 2
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] sched_reserve: reserve took 60.19 ms, sched copies = 1
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] srv    load_model: loading draft model '/media/Downloads/Qwen3.6-dflash.gguf'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon AI PRO R9700 (RADV GFX1201)) (0000:0a:00.0) - 1388 MiB free
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: loaded meta data with 31 key-value pairs and 58 tensors from /media/Downloads/Qwen3.6-dflash.gguf (version GGUF V3 (latest))
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   0:                       general.architecture str              = dflash
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   1:                               general.type str              = model
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   2:                               general.name str              = Qwen3.6 27B DFlash
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   3:                           general.finetune str              = 27b-DFlash
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   4:                           general.basename str              = Qwen3.6
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   5:                         general.size_label str              = 1.7B
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   6:                         dflash.block_count u32              = 5
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   7:                      dflash.context_length u32              = 262144
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   8:                    dflash.embedding_length u32              = 5120
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv   9:                 dflash.feed_forward_length u32              = 17408
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  10:                dflash.attention.head_count u32              = 32
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  11:             dflash.attention.head_count_kv u32              = 8
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  12:                      dflash.rope.freq_base f32              = 10000000.000000
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  13:    dflash.attention.layer_norm_rms_epsilon f32              = 0.000001
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  14:                dflash.attention.key_length u32              = 128
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  15:              dflash.attention.value_length u32              = 128
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  16:                          general.file_type u32              = 7
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  17:                          dflash.block_size u32              = 16
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  18:                    dflash.target_layer_ids arr[i32,5]       = [2, 17, 32, 47, 62]
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  19:                       dflash.mask_token_id u32              = 248070
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  20:               general.quantization_version u32              = 2
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen35
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 248046
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 248044
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 248044
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - type  f32:   22 tensors
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] llama_model_loader: - type q8_0:   36 tensors
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: file format = GGUF V3 (latest)
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: file type   = Q8_0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: file size   = 1.71 GiB (8.50 BPW)
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load_hparams: DFlash extract_layers = [2, 17, 32, 47, 62]
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load_hparams: DFlash block_size = 16, mask_token_id = 248070
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load: 0 unused tokens
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load: printing all EOG tokens:
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load:   - 248044 ('<|endoftext|>')
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load:   - 248046 ('<|im_end|>')
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load:   - 248063 ('<|fim_pad|>')
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load:   - 248064 ('<|repo_name|>')
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load:   - 248065 ('<|file_sep|>')
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load: special tokens cache size = 33
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load: token to piece cache size = 1.7581 MB
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: arch                  = dflash
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: vocab_only            = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: no_alloc              = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_ctx_train           = 262144
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_embd                = 5120
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_embd_inp            = 5120
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_layer               = 5
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_head                = 32
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_head_kv             = 8
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_rot                 = 128
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_swa                 = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: is_swa_any            = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_embd_head_k         = 128
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_embd_head_v         = 128
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_gqa                 = 4
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_embd_k_gqa          = 1024
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_embd_v_gqa          = 1024
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: f_norm_eps            = 0.0e+00
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: f_norm_rms_eps        = 1.0e-06
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: f_clamp_kqv           = 0.0e+00
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: f_max_alibi_bias      = 0.0e+00
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: f_logit_scale         = 0.0e+00
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: f_attn_scale          = 0.0e+00
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_ff                  = 17408
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_expert              = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_expert_used         = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_expert_groups       = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_group_used          = 0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: causal attn           = 1
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: pooling type          = -1
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: rope type             = 2
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: rope scaling          = linear
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: freq_base_train       = 10000000.0
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: freq_scale_train      = 1
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_ctx_orig_yarn       = 262144
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: rope_yarn_log_mul     = 0.0000
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: rope_finetuned        = unknown
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: model type            = ?B
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: model params          = 1.73 B
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: general.name          = Qwen3.6 27B DFlash
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: vocab type            = BPE
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_vocab               = 248320
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: n_merges              = 247587
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: BOS token             = 248044 '<|endoftext|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: EOS token             = 248046 '<|im_end|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: EOT token             = 248046 '<|im_end|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: PAD token             = 248044 '<|endoftext|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: LF token              = 198 'Ċ'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: FIM MID token         = 248061 '<|fim_middle|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: FIM PAD token         = 248063 '<|fim_pad|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: FIM REP token         = 248064 '<|repo_name|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: FIM SEP token         = 248065 '<|file_sep|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248044 '<|endoftext|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248046 '<|im_end|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248063 '<|fim_pad|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248064 '<|repo_name|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: EOG token             = 248065 '<|file_sep|>'
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] print_info: max token length      = 256
Apr 27 08:12:04 michael-server llama-server[888649]: [39385] load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
Apr 27 08:12:08 michael-server llama-server[888649]: [39385] load_tensors: offloading output layer to GPU
Apr 27 08:12:08 michael-server llama-server[888649]: [39385] load_tensors: offloading 4 repeating layers to GPU
Apr 27 08:12:08 michael-server llama-server[888649]: [39385] load_tensors: offloaded 6/6 layers to GPU
Apr 27 08:12:08 michael-server llama-server[888649]: [39385] load_tensors:      Vulkan0 model buffer size =  1753.36 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] .............................
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] set_dflash: DFlash extraction enabled for layers [2, 17, 32, 47, 62]
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    load_model: DFlash feature extraction enabled on target model
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    load_model: cache_reuse is not supported by this context, it will be disabled
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    load_model: initializing slots, n_slots = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: reserving ...
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve:    Vulkan0 compute buffer size =  1335.28 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: Vulkan_Host compute buffer size =   532.29 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: graph nodes  = 3849
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: graph splits = 2
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: reserve took 121.81 ms, sched copies = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] common_context_can_seq_rm: the target context does not support partial sequence removal
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    load_model: speculative decoding will use checkpoints
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: constructing llama_context
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_seq_max     = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ctx         = 2048
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ctx_seq     = 2048
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_batch       = 2048
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ubatch      = 512
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: causal_attn   = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: flash_attn    = enabled
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: kv_unified    = false
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: freq_base     = 10000000.0
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: freq_scale    = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: Vulkan_Host  output buffer size =     0.97 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: reserving ...
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: resolving fused Gated Delta Net support:
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: fused Gated Delta Net (autoregressive) enabled
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: fused Gated Delta Net (chunked) enabled
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve:    Vulkan0 compute buffer size =    60.00 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: Vulkan_Host compute buffer size =    50.00 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: graph nodes  = 3
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: graph splits = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: reserve took 17.69 ms, sched copies = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_init_from_model: DFlash auto-setup: using target model's embedding + lm_head layers
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: constructing llama_context
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_seq_max     = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ctx         = 2048
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ctx_seq     = 2048
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_batch       = 2048
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ubatch      = 512
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: causal_attn   = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: flash_attn    = enabled
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: kv_unified    = false
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: freq_base     = 10000000.0
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: freq_scale    = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] llama_context: Vulkan_Host  output buffer size =     0.97 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: reserving ...
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: resolving fused Gated Delta Net support:
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: fused Gated Delta Net (autoregressive) enabled
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: fused Gated Delta Net (chunked) enabled
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve:    Vulkan0 compute buffer size =   495.00 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: Vulkan_Host compute buffer size =    60.02 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: graph nodes  = 180
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: graph splits = 2
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] sched_reserve: reserve took 6.90 ms, sched copies = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot   load_model: id  0 | task -1 | speculative decoding context initialized
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot   load_model: id  0 | task -1 | new slot, n_ctx = 262144
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    load_model: prompt cache is enabled, size limit: 8192 MiB
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    load_model: use `--cache-ram 0` to disable the prompt cache
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv          init: init: --cache-idle-slots requires --kv-unified, disabling
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] init: chat template, example_format: '<|im_start|>system
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] You are a helpful assistant<|im_end|>
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] <|im_start|>user
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] Hello<|im_end|>
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] <|im_start|>assistant
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] <think>
Apr 27 08:12:09 michael-server llama-server[888649]: [39385]
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] </think>
Apr 27 08:12:09 michael-server llama-server[888649]: [39385]
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] Hi there<|im_end|>
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] <|im_start|>user
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] How are you?<|im_end|>
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] <|im_start|>assistant
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] <think>
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] '
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv          init: init: chat template, thinking = 1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] main: model loaded
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] main: server is listening on http://127.0.0.1:39385
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] main: starting the main loop...
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] cmd_child_to_router:ready
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv    operator(): child server monitoring thread started, waiting for EOF on stdin...
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv  update_slots: all slots are idle
Apr 27 08:12:09 michael-server llama-server[888649]: srv  proxy_reques: proxying request to model Qwen3.6 on port 39385
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv  params_from_: Chat format: peg-native
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv  get_availabl: updating prompt cache
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] srv  get_availabl: prompt cache update took 0.00 ms
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = -1, task.n_tokens = 975
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
Apr 27 08:12:09 michael-server llama-server[888649]: [39385] slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 459, batch.n_tokens = 459, progress = 0.470769
Apr 27 08:12:11 michael-server llama-server[888649]: [39385] slot update_slots: id  0 | task 0 | n_tokens = 459, memory_seq_rm [459, end)
Apr 27 08:12:11 michael-server llama-server[888649]: [39385] slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 971, batch.n_tokens = 512, progress = 0.995897
Apr 27 08:12:11 michael-server llama-server[888649]: [39385] slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 458, pos_max = 458, n_tokens = 459, size = 149.626 MiB)
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] slot update_slots: id  0 | task 0 | n_tokens = 971, memory_seq_rm [971, end)
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] reasoning-budget: activated, budget=2147483647 tokens
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] slot init_sampler: id  0 | task 0 | init sampler, took 0.12 ms, tokens: text = 975, total = 975
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 975, batch.n_tokens = 4
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] slot create_check: id  0 | task 0 | created context checkpoint 2 of 32 (pos_min = 970, pos_max = 970, n_tokens = 971, size = 149.626 MiB)
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Apr 27 08:12:14 michael-server llama-server[888649]: srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /build/source/src/llama-context.cpp:1393: GGML_ASSERT(cparams.n_ubatch >= n_tokens && "encoder requires n_ubatch >= n_tokens") failed
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/lib/libggml-base.so.0(+0x1979a) [0x7ffbc169879a]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/lib/libggml-base.so.0(ggml_print_backtrace+0x204) [0x7ffbc1698c64]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/lib/libggml-base.so.0(ggml_abort+0x159) [0x7ffbc1698e39]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/lib/libllama.so.0(_ZN13llama_context6encodeERK11llama_batch+0x1089) [0x7ffbc54f9459]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/lib/libllama.so.0(llama_encode+0x11) [0x7ffbc54f9481]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/lib/libllama-common.so.0(_ZN31common_speculative_state_dflash5draftERK25common_params_speculativeRKSt6vectorIiSaIiEEiRS5_+0xff) [0x7ffbc5af92bf]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/lib/libllama-common.so.0(_Z24common_speculative_draftP18common_speculativeRK25common_params_speculativeRKSt6vectorIiSaIiEEi+0xa4) [0x7ffbc5af1894]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/bin/llama-server(+0x1010ce) [0x564c8aff70ce]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/bin/llama-server(+0x10be68) [0x564c8b001e68]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/bin/llama-server(+0x1a8312) [0x564c8b09e312]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/bin/llama-server(+0x6e5e2) [0x564c8af645e2]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/jms7zxzm7w1whczwny5m3gkgdjghmi2r-glibc-2.42-51/lib/libc.so.6(+0x2b285) [0x7ffbc095d285]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/jms7zxzm7w1whczwny5m3gkgdjghmi2r-glibc-2.42-51/lib/libc.so.6(__libc_start_main+0x88) [0x7ffbc095d338]
Apr 27 08:12:14 michael-server llama-server[888649]: [39385] /nix/store/vh2367dlgknxjq9bgivlx314fxz91aw5-llama-cpp-vulkan-0.0.0/bin/llama-server(+0x6ecb5) [0x564c8af64cb5]

mbednarek360 · 2026-04-27T12:34:22Z

getting this startup error: /home/raghuboi/llama.cpp/src/llama-context.cpp:2509: GGML_ASSERT(tensor != nullptr && "DFlash extraction tensor is null") failed

tried these models: Qwen 3 1.7B Qwen 3.5 2b

meanwhile https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF fails to load on startup: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'dflash-draft' llama_model_load_from_file_impl: failed to load model srv load_model: failed to load draft model, '/home/raghuboi/Desktop/models/qwen-3.6/dflash-draft-3.6-q8_0.gguf' srv operator(): operator(): cleaning up before exit...

set_dflash: DFlash extraction enabled for layers [0, 0, 0, 0, 0] srv load_model: DFlash feature extraction enabled on target model srv load_model: initializing slots, n_slots = 1 sched_reserve: reserving ... sched_reserve: CUDA0 compute buffer size = 4981.19 MiB sched_reserve: CUDA1 compute buffer size = 2373.19 MiB sched_reserve: CUDA2 compute buffer size = 3119.20 MiB sched_reserve: CUDA_Host compute buffer size = 4137.22 MiB sched_reserve: graph nodes = 3849 sched_reserve: graph splits = 4 sched_reserve: reserve took 938.14 ms, sched copies = 4 /home/raghuboi/llama.cpp/src/llama-context.cpp:2509: GGML_ASSERT(tensor != nullptr && "DFlash extraction tensor is null") failed

exec "$LLAMA_SERVER" -m "$MODEL" -md "/home/raghuboi/Desktop/models/qwen-3.6/Qwen3-1.7B-Q8_0.gguf" --dflash --draft-max 16 --alias "qwen3.6" --host 0.0.0.0 --port 8081 -ngl 99 -np 1 -kvu --fit on --fit-target 1024 --fit-ctx 262144 -c 262144 -cd 4096 --split-mode layer -b 2048 -ub 1024 --main-gpu 1 --mlock --no-mmap --flash-attn on --chat-template-kwargs '{"preserve_thinking":true}' --cache-type-k q8_0 --cache-type-v q8_0 --cache-ram 38912 --ctx-checkpoints 128 --cont-batching --jinja --threads 8 --threads-batch 16 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.0 --presence-penalty 0.0 \

"$LOG" 2>&1

The DFlash GGUF you referenced is meant for another fork of llama.cpp, not this PR.
See: https://github.com/spiritbuun/buun-llama-cpp

ruixiang63 · 2026-04-27T12:48:44Z

My plan of next steps for this PR:

This PR currently supports llama-cli and llama-server only with n_parallel = 1 (multi-slot batching not yet implemented). Full functional support will land after the EAGLE3 PR is merged and the unified speculative API (EAGLE3, DFlash, MTP etc.) is finalized. (This PR only supports DFlash GGUF models converted with the converter included in this PR. )
I'll then rebase this PR onto that API and follow up with further performance optimizations and n_parallel > 1 support.

Current working commands for llama-cli and llama-server, e.g.

# llama-cli
./build/bin/llama-cli \
  -m "${TARGET_MODEL_GGUF}" \
  -md "${DFLASH_MODEL_GGUF}" \
  --dflash -p "Write a quicksort algorithm in Python. Write code only." -n 256 --draft-max 16 \
  -cd 512 -c 512 \
  --temp 0 --top-k 1 --seed 42 -ngl 99 -ngld 99 \
  --jinja -rea off

# llama-server
./build/bin/llama-server \
  -m "${TARGET_MODEL_GGUF}" \
  -md "${DFLASH_MODEL_GGUF}" \
  --dflash --draft-max 16 \
  -c 2048 -cd 512 \
  --temp 0 --top-k 1 --seed 42 \
  -ngl 99 -ngld 99 \
  --jinja -rea off \
  -np 1 \
  --host 0.0.0.0 --port 8088

HH1162 · 2026-04-27T15:05:32Z

Why isn’t there any speedup after enabling the dfloat parameter on this branch?Meanwhile, performance drops significantly when I switch to the official parameters. T_T
Anyone able to help me out?

200 tokens/s as normal:
/home/xxxx/llar/bin/llama-server
-m "/home/xxxx/models/Qwen3.6-35B-A3B-Q6_K.gguf"
-md "/home/xxxx/models/Qwen3.6-35B-A3B-DFlash-q8_0.gguf"
--dflash
--draft-max 16
--draft-p-min 0.9
--ctx-size 16384
--n-gpu-layers 99
-ngld 99
--host 127.0.0.1
--port 1234
-fa on
-ctk q4_0
-ctv q4_0
--verbose
--samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature"
--repeat-penalty 1.1
--temp 0
--top-k 1
--no-mmap
--mlock
--kv-unified
--parallel 1
--sleep-idle-seconds -1
--verbose
--batch-size 8192
--ubatch-size 2048
-n 8192
--threads 14
--threads-batch 28
--reasoning-format deepseek
--reasoning-budget 1024
--prio 2
--jinja

fallback to 40 tokens:
/home/xxxx/llar/bin/llama-server
-m "/home/xxxx/models/Qwen3.6-35B-A3B-Q6_K.gguf"
-md "/home/xxxx/models/Qwen3.6-35B-A3B-DFlash-q8_0.gguf"
-ngl 99
-ngld 99
--dflash
--draft-max 16
-c 2048
-cd 512
--temp 0
--top-k 1
--seed 42
--jinja -rea off
-np 1
--host 127.0.0.1
--port 1234

aminya · 2026-04-28T02:15:46Z

I've tried many different patches and configurations over the weekend for my single 3090 setup. There's no benefit in Dflash I can see. I cannot reproduce any of the claimed speed ups in real workflows with Qwen 27B or Qwen 35B.

Originally posted by @aminya in TheTom#103 (comment)

ukrospm · 2026-04-28T04:46:33Z

For me crashing after generating 1 token.

set CUDA_VISIBLE_DEVICES=0,1 a:\0_llama_server_d\build\bin\Release\llama-server ^ -m a:\0_LM_Studio\lmstudio-community\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q4_K_M.gguf ^ -md a:\0_LM_Studio\lym00\Qwen3.6-27B-DFlash-bf16.gguf ^ --host 0.0.0.0 ^ --port 8088 ^ -ngl 99 ^ -ngld 99 ^ --dflash ^ --draft-max 16 ^ -c 2048 ^ -cd 512 ^ --temp 0 ^ --top-k 1 ^ --seed 42 ^ --jinja -rea off ^ -np 1 --verbose

�[0mdone_getting_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
�[0m
.
.
.
�[0msrv update_slots: decoding batch, n_tokens = 4
�[0mset_adapters_lora: adapters = 0000000000000000
�[0madapters_lora_are_same: adapters = 0000000000000000
�[0mset_embeddings: value = 0
�[0msrv operator (): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1777351446,"id":"chatcmpl-DbepysCBNfjQ2DyiF5glro4x77VGN159","model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":9,"prompt_ms":0.0,"prompt_per_token_ms":0.0,"prompt_per_second":null,"predicted_n":0,"predicted_ms":0.0,"predicted_per_token_ms":null,"predicted_per_second":null},"prompt_progress":{"total":13,"cache":0,"processed":9,"time_ms":656}}

�[0mextract_dflash_features: Start to extract DFlash features: 5 layers, 4 tokens, 5120 embd
�[0mres send: sending result for task id = 0
�[0mres send: task id = 0 pushed to result queue
�[0msrv operator (): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1777351446,"id":"chatcmpl-DbepysCBNfjQ2DyiF5glro4x77VGN159","model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":13,"prompt_ms":0.0,"prompt_per_token_ms":0.0,"prompt_per_second":null,"predicted_n":0,"predicted_ms":0.0,"predicted_per_token_ms":null,"predicted_per_second":null},"prompt_progress":{"total":13,"cache":0,"processed":13,"time_ms":728}}

�[0mres send: sending result for task id = 0
�[0mres send: task id = 0 pushed to result queue
�[0mslot process_toke: id 0 | task 0 | n_decoded = 1, n_remaining = -1, next token: 9419 'Hello'
�[0msrv update_slots: run slots completed
�[0mque start_loop: waiting for new tasks
�[0mque start_loop: processing new tasks
�[0mque start_loop: processing task, id = 2
�[0mque start_loop: update slots
�[0msrv update_slots: posting NEXT_RESPONSE
�[0mque post: new task, id = 3, front = 0
�[0mslot get_n_draft_: id 0 | task 0 | max possible draft: 16
�[0msrv operator (): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1777351446,"id":"chatcmpl-DbepysCBNfjQ2DyiF5glro4x77VGN159","model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Hello"}}],"created":1777351446,"id":"chatcmpl-DbepysCBNfjQ2DyiF5glro4x77VGN159","model":"Qwen3.6-27B-Q4_K_M.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":13,"prompt_ms":731.354,"prompt_per_token_ms":56.258,"prompt_per_second":17.77524974225888,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}

�[0m

CISC · 2026-06-28T12:44:01Z

+        block_size = self.hparams.get("block_size", 16)
+        self.gguf_writer.add_uint32(f"{self.gguf_writer.arch}.block_size", block_size)
+        dflash_config = self.hparams.get("dflash_config", {})
+
+        target_layer_ids = dflash_config.get("target_layer_ids", [])
+        if target_layer_ids:
+            extract_layer_ids = [i + 1 for i in target_layer_ids]
+            self.gguf_writer.add_array(f"{self.gguf_writer.arch}.target_layers", extract_layer_ids)


Add proper keys and methods for these please!

CISC · 2026-06-28T13:07:29Z

+        mask_token_id = dflash_config.get("mask_token_id", None)
+        if mask_token_id is not None:
+            self.gguf_writer.add_mask_token_id(mask_token_id)


Suggested change

mask_token_id = dflash_config.get("mask_token_id", None)

if mask_token_id is not None:

self.gguf_writer.add_mask_token_id(mask_token_id)

mask_token_id = dflash_config.get("mask_token_id", None)

if mask_token_id is not None:

self.hparams["mask_token_id"] = mask_token_id

I'm not sure of the purpose of separating the token id like this, but this would have gotten overridden by SpecialVocab later on if there already was a mask_token_id in the config.

CISC · 2026-06-28T13:13:04Z

+        if not name.startswith("model."):
+            name = "model." + name


This belongs in filter_tensors and the two above should have gotten renamed and properly mapped in tensor_mapping.

CISC · 2026-06-28T13:15:26Z

Too late, but follow up please!

ruixiang63 · 2026-06-28T13:52:09Z

Too late, but follow up please!

Thanks for the review. Will address these in a follow-up PR.

ulyssesrr · 2026-06-28T19:10:22Z

@ruixiang63 is this CUDA only for now?

Using --device Vulkan1,CUDA0 crashes on first inference:

0.25.864.084 I sched_reserve: reserving ...
0.25.864.105 I srv  stream_sessi: conv_id=5ubuwzeyjnc (empty=0)
0.25.901.236 I sched_reserve:    Vulkan1 compute buffer size =   367.24 MiB
0.25.901.240 I sched_reserve:      CUDA0 compute buffer size =   233.36 MiB
0.25.901.241 I sched_reserve: Vulkan_Host compute buffer size =    53.13 MiB
0.25.901.242 I sched_reserve: graph nodes  = 6007
0.25.901.242 I sched_reserve: graph splits = 3
0.25.901.243 I sched_reserve: reserve took 37.14 ms, sched copies = 4
/git/llama.cpp/ggml/src/ggml-backend.cpp:194: GGML_ASSERT(buffer) failed
...
This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fdba8ca0a52 in ?? () from /usr/lib/libc.so.6
#0  0x00007fdba8ca0a52 in ?? () from /usr/lib/libc.so.6
#1  0x00007fdba8c94abc in ?? () from /usr/lib/libc.so.6
#2  0x00007fdba8c94b04 in ?? () from /usr/lib/libc.so.6
#3  0x00007fdba8d05c6f in wait4 () from /usr/lib/libc.so.6
#4  0x0000561b695cbd3b in ggml_print_backtrace ()
#5  0x0000561b695cc298 in ggml_abort ()
#6  0x0000561b695e7535 in ggml_backend_buffer_get_type.part ()
#7  0x0000561b695e86f9 in ggml_backend_buffer_is_host ()
#8  0x0000561b68315273 in llama_kv_cache::set_input_k_rot(ggml_tensor*) const ()
#9  0x0000561b682facdb in llm_graph_input_attn_kv_iswa::set_input(llama_ubatch const*) ()
#10 0x0000561b682ff110 in llm_graph_result::set_inputs(llama_ubatch const*) ()
#11 0x0000561b682ed0bb in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
#12 0x0000561b682f0c2b in llama_context::decode(llama_batch const&) ()
#13 0x0000561b682f2dee in llama_decode ()
#14 0x0000561b6811ea59 in common_speculative_impl_draft_dflash::process(llama_batch const&) ()
#15 0x0000561b68117784 in common_speculative_process(common_speculative*, llama_batch const&) ()
#16 0x0000561b67f2996f in server_context_impl::decode(int&, int, llama_batch&) ()
#17 0x0000561b67f2bfc8 in server_context_impl::update_slots() ()
#18 0x0000561b67ec0931 in server_queue::start_loop(long) ()
#19 0x0000561b67e5cdcc in llama_server(int, char**) ()
#20 0x00007fdba8c27741 in ?? () from /usr/lib/libc.so.6
#21 0x00007fdba8c27879 in __libc_start_main () from /usr/lib/libc.so.6
#22 0x0000561b67e54d05 in _start ()

Using --device Vulkan1,Vulkan0 still tries to use CUDA and crashes on warmup:

0.15.448.724 I llama_prepare_model_devices: skipping device Vulkan0 (NVIDIA GeForce RTX 4060 Ti) with id 0000:0a:00.0 - already using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) with the same id
0.15.448.856 I llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) (0000:0a:00.0) - 2797 MiB free
0.15.448.963 I llama_prepare_model_devices: using device Vulkan1 (AMD Radeon RX 5700 XT (RADV NAVI10)) (0000:06:00.0) - 1776 MiB free
0.15.449.003 I load_arch_hparams: DFlash extract_layers = [0.15.449.004 I 2, 0.15.449.005 I 17, 0.15.449.005 I 32, 0.15.449.005 I 47, 0.15.449.005 I 620.15.449.006 I ]
...
/git/llama.cpp/ggml/src/ggml-backend.cpp:898: pre-allocated tensor (output.weight) in a buffer (Vulkan0) that cannot run the operation (NONE)
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f9a1baa0a52 in ?? () from /usr/lib/libc.so.6
#0  0x00007f9a1baa0a52 in ?? () from /usr/lib/libc.so.6
#1  0x00007f9a1ba94abc in ?? () from /usr/lib/libc.so.6
#2  0x00007f9a1ba94b04 in ?? () from /usr/lib/libc.so.6
#3  0x00007f9a1bb05c6f in wait4 () from /usr/lib/libc.so.6
#4  0x000055c4fdf0dd3b in ggml_print_backtrace ()
#5  0x000055c4fdf0e298 in ggml_abort ()
#6  0x000055c4fdf2a312 in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) ()
#7  0x000055c4fdf2e92c in ggml_backend_sched_split_graph.part ()
#8  0x000055c4fcc2fa63 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) ()
#9  0x000055c4fcc309e8 in llama_context::sched_reserve() ()
#10 0x000055c4fcc36a8d in llama_context::llama_context(llama_model const&, llama_context_params) ()
#11 0x000055c4fcc37d54 in llama_init_from_model ()
#12 0x000055c4fc87e938 in server_context_impl::load_model(common_params&) ()
#13 0x000055c4fc79eb2b in llama_server(int, char**) ()
#14 0x00007f9a1ba27741 in ?? () from /usr/lib/libc.so.6
#15 0x00007f9a1ba27879 in __libc_start_main () from /usr/lib/libc.so.6
#16 0x000055c4fc796d05 in _start ()

Finally, Using --device Vulkan1,Vulkan0 and CUDA_VISIBLE_DEVICES="" produces the same crash(on first inference) as --device Vulkan1,CUDA0

0.14.085.059 I print_info: file format = GGUF V3 (latest)
0.14.085.059 I print_info: file type   = IQ4_XS - 4.25 bpw
0.14.085.063 I print_info: file size   = 880.71 MiB (4.27 BPW) 
0.14.085.643 I llama_prepare_model_devices: using device Vulkan0 (NVIDIA GeForce RTX 4060 Ti) (0000:0a:00.0) - 2836 MiB free
0.14.085.751 I llama_prepare_model_devices: using device Vulkan1 (AMD Radeon RX 5700 XT (RADV NAVI10)) (0000:06:00.0) - 1776 MiB free
0.14.085.788 I load_arch_hparams: DFlash extract_layers = [0.14.085.790 I 2, 0.14.085.791 I 17, 0.14.085.791 I 32, 0.14.085.791 I 47, 0.14.085.791 I 620.14.085.792 I ]
...
.00.839.017 I sched_reserve: reserving ...
3.00.839.030 I srv  stream_sessi: conv_id=5ubuwzeyjnc (empty=0)
3.00.915.825 I sched_reserve:    Vulkan1 compute buffer size =   367.24 MiB
3.00.915.830 I sched_reserve:    Vulkan0 compute buffer size =   375.36 MiB
3.00.915.831 I sched_reserve: Vulkan_Host compute buffer size =    53.13 MiB
3.00.915.832 I sched_reserve: graph nodes  = 6007
3.00.915.832 I sched_reserve: graph splits = 3
3.00.915.833 I sched_reserve: reserve took 76.80 ms, sched copies = 4
/git/llama.cpp/ggml/src/ggml-backend.cpp:194: GGML_ASSERT(buffer) failed
...
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f303faa0a52 in ?? () from /usr/lib/libc.so.6
#0  0x00007f303faa0a52 in ?? () from /usr/lib/libc.so.6
#1  0x00007f303fa94abc in ?? () from /usr/lib/libc.so.6
#2  0x00007f303fa94b04 in ?? () from /usr/lib/libc.so.6
#3  0x00007f303fb05c6f in wait4 () from /usr/lib/libc.so.6
#4  0x000055705305ad3b in ggml_print_backtrace ()
#5  0x000055705305b298 in ggml_abort ()
#6  0x0000557053076535 in ggml_backend_buffer_get_type.part ()
#7  0x00005570530776f9 in ggml_backend_buffer_is_host ()
#8  0x0000557051da4273 in llama_kv_cache::set_input_k_rot(ggml_tensor*) const ()
#9  0x0000557051d89cdb in llm_graph_input_attn_kv_iswa::set_input(llama_ubatch const*) ()
#10 0x0000557051d8e110 in llm_graph_result::set_inputs(llama_ubatch const*) ()
#11 0x0000557051d7c0bb in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
#12 0x0000557051d7fc2b in llama_context::decode(llama_batch const&) ()
#13 0x0000557051d81dee in llama_decode ()
#14 0x0000557051bada59 in common_speculative_impl_draft_dflash::process(llama_batch const&) ()
#15 0x0000557051ba6784 in common_speculative_process(common_speculative*, llama_batch const&) ()
#16 0x00005570519b896f in server_context_impl::decode(int&, int, llama_batch&) ()
#17 0x00005570519bafc8 in server_context_impl::update_slots() ()
#18 0x000055705194f931 in server_queue::start_loop(long) ()
#19 0x00005570518ebdcc in llama_server(int, char**) ()
#20 0x00007f303fa27741 in ?? () from /usr/lib/libc.so.6
#21 0x00007f303fa27879 in __libc_start_main () from /usr/lib/libc.so.6
#22 0x00005570518e3d05 in _start ()

Full command:

/git/llama.cpp/build/bin/llama-server \
    --log-verbosity 4 \
    --model "$MODEL_DIR/Qwen3.6-27B-IQ4_XS.gguf" \
    --model-draft "$MODEL_DIR/Qwen3.6-27B-DFlash-IQ4_XS.gguf" \
    --spec-type draft-dflash \
    --spec-draft-n-max 15 \
    --cache-type-k-draft q8_0 \
    --cache-type-v-draft q8_0 \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --parallel 1 \
    --device Vulkan1,CUDA0 \
    --no-mmap \
    --mlock \
    --fit on \
    --fit-target 1,640 \
    --ctx-size 65536 \
    --flash-attn 1 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 12 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.00 \
    --repeat-penalty 1.0 \
    --host 0.0.0.0 \
    --seed 42 \
    --port 51000

AbdulrahmanHashem · 2026-06-28T21:07:58Z

@ruixiang63 can you link me the q4 k m quantized draft model i can use if there is any please and thank you very much for the work ^_^?

AbdulrahmanHashem · 2026-06-28T21:18:51Z

because i'm getting

0.00.853.257 E llama_model_load: error loading model: unknown model architecture: 'dflash-draft'
0.00.853.262 E llama_model_load_from_file_impl: failed to load model

with
https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF
and
https://huggingface.co/Anbeeld/Qwen3.6-27B-DFlash-GGUF

ulyssesrr · 2026-06-29T05:16:45Z

because i'm getting
0.00.853.257 E llama_model_load: error loading model: unknown model architecture: 'dflash-draft'
0.00.853.262 E llama_model_load_from_file_impl: failed to load model
with https://huggingface.co/spiritbuun/Qwen3.6-27B-DFlash-GGUF and https://huggingface.co/Anbeeld/Qwen3.6-27B-DFlash-GGUF

The unknown model architecture: 'dflash-draft' error means you are using an DFlash model meant for other llama.cpp forks, for this repo try this one:
https://huggingface.co/jojohai/Qwen3.6-27B-DFlash-GGUF

gelim · 2026-06-30T15:33:59Z

Hello,

I was looking for people comparing MTP performance vs. DFlash on Qwen3.6-27B but found nothing, so commenting here. I am testing on 2 GPUs
I tested with unsloth/Qwen3.6-27B Q8_0 and the Q8_0+Q4_XS draft from williamliao/qwen3.6-27B-DFlash-GGUF.

For my setup, MTP still outperforms DFlash on code generation (my main use-case).

_	MTP	DFlash Q8	DFlash IQ4_XS	Base ref (no spec.)
Draft acceptance	0.87	0.73	0.73	N/A
Mean len	2.74	3.22	3.20	N/A
TG 1024 (d=0)	46 tok/sec	38 tok/sec	38 tok/sec	23 tok/sec

Interestingly I see that draft acceptance is same for Q8 & IQ4_XS and was expecting perf to be better then on IQ4_XS but tg speed is same.

And for people with one GPU only (24GB and 131k context max, thanks to KV cache in Q8), testing on unsloth Q4_K_M with IQ4_XS DFlash draft:

_	MTP	DFlash IQ4_XS	Base ref (no spec.)
VRAM usage	22.2 GB	23.3 GB	21 GB
Draft acceptance	0.9	0.81	N/A
Mean len	2.81	3.44	N/A
TG 1024 (d=0)	50.7 tok/sec	49.5 tok/sec	32 tok/sec

test command (to be adapted for Q4)

MODEL=unsloth_Qwen3.6-27B-Q8_0.gguf
DMODEL=Qwen3.6-27B-DFlash-Q8_0.gguf
CUDA_VISIBLE_DEVICES=0,1 llama-server --no-mmap \
        --top-k 20 --top-p 0.95 --min-p 0.0 --temp 0.0 --reasoning off \
        -m $MODEL \
        -md $DMODEL
        --spec-type draft-dflash  -ngld 99 -np 1 \
        --cache-type-k q8_0 --cache-type-v q8_0 \
        -ngl 99 -ngld 99 -fa on --jinja -c 131072

prompt: write a pi estimator using different methods in python. Code only.
that yields ~1k tokens.

NB: for DFlash tests running main model on MTP or non MTP flavors does not change performances, I guess it's just taking more memory for the MTP heads that are not used anyway....

b1skit · 2026-06-30T18:49:51Z

@gelim Interesting. How many tokens were you trying to predict in both cases?

gelim · 2026-07-01T05:51:03Z

@gelim Interesting. How many tokens were you trying to predict in both cases?

MTP run had --spec-draft-n-max 2 --spec-draft-p-min 0.0
DFlash has no configuration knobs.

nipeone · 2026-07-01T07:53:53Z

can use dflash and mmproj together?like this: llama-server -m Qwen3.5-9B-UD-Q5_K_XL.gguf -md Qwen3.5-9B-DFlash-q8_0.gguf -ngl 99 -c 16384 -fa on -np 1 --no-mmap --temp 0.1 --top-p 0.9 --min-p 0.05 --repeat-penalty 1.05 --no-cache-prompt --spec-type draft-dflash --spec-draft-n-max 15 --port 9090 --host 0.0.0.0 --image-min-tokens 1024 --alias Qwen3.5-9B
i occured an error: E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting)

* spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

gelim · 2026-07-01T08:30:05Z

can use dflash and mmproj together?like this: llama-server -m Qwen3.5-9B-UD-Q5_K_XL.gguf -md Qwen3.5-9B-DFlash-q8_0.gguf -ngl 99 -c 16384 -fa on -np 1 --no-mmap --temp 0.1 --top-p 0.9 --min-p 0.05 --repeat-penalty 1.05 --no-cache-prompt --spec-type draft-dflash --spec-draft-n-max 15 --port 9090 --host 0.0.0.0 --image-min-tokens 1024 --alias Qwen3.5-9B i occured an error: E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting)

Yes this is working fine. I tested with adding --mmproj mmproj-F16.ggufto my previously documented llama-server command. Test in the build-in web UI with a picture added as attachment.

$ llama-server --version
version: 9851 (0eca4d490)
built with GNU 11.4.0 for Linux x86_64

En3Tho · 2026-07-01T08:41:46Z

@gelim What backend are you using? I'm using Vulkan + 2x 9070XT and with DFlash Vram usage goes through the roof. Around ~5 more gb of vram which goes to shared memory

gelim · 2026-07-01T10:37:07Z

CUDA on Tesla V100 32gb cards

nipeone · 2026-07-02T14:17:23Z

there‘s still some problems:

llama-server --version
version: 9859 (4fc4ec554)
built with GNU 13.3.0 for Linux x86_64

when use this script

llama-server \
    -m  Qwen3.6-27B-Q4_K_M.gguf \
    -md Qwen3.6-27B-DFlash-b16.gguf \
    --spec-type draft-dflash \
    --spec-draft-n-max 7 \
    --temp 0 \
    --top-k 1 \
    -np 1 \
    -c 10240 \
    --host 0.0.0.0 \
    --port 9090 \
    --alias Qwen3.6-27B \
    -ngl 99 \
    -fa on \

i get this log:

0.00.075.271 I cmn  common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.168.542 I srv    load_model: loading model 'Qwen3.6-27B-Q4_K_M.gguf'
0.00.461.531 E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting)
0.00.507.649 W srv    load_model: [spec] failed to measure draft model memory: failed to create llama_context from model
0.07.911.111 I srv    load_model: initializing, n_slots = 1, n_ctx_slot = 10240, kv_unified = 'false'
0.07.911.142 I common_speculative_impl_draft_dflash: adding speculative implementation 'draft-dflash'
0.07.911.145 I common_speculative_impl_draft_dflash: - n_max=7, n_min=0, p_min=0.00
0.07.911.145 I common_speculative_impl_draft_dflash: - block_size=16, mask_token_id=248070, n_extract=5
0.07.947.129 I srv          init: chat template supports preserving reasoning, consider enabling it via --reasoning-preserve
0.07.947.156 I srv  llama_server: model loaded
0.07.947.159 I srv  llama_server: listening on http://0.0.0.0:9090
0.29.276.782 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.29.276.839 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.32.631.781 I slot print_timing: id  0 | task 0 | n_decoded =    243, tg =  80.08 t/s, tg_3s =  80.08 t/s
0.35.657.438 I slot print_timing: id  0 | task 0 | n_decoded =    446, tg =  73.60 t/s, tg_3s =  67.09 t/s
0.38.689.574 I slot print_timing: id  0 | task 0 | n_decoded =    591, tg =  65.00 t/s, tg_3s =  47.82 t/s
0.41.711.604 I slot print_timing: id  0 | task 0 | n_decoded =    715, tg =  59.02 t/s, tg_3s =  41.03 t/s
0.44.753.392 I slot print_timing: id  0 | task 0 | n_decoded =    832, tg =  54.90 t/s, tg_3s =  38.46 t/s
0.47.776.519 I slot print_timing: id  0 | task 0 | n_decoded =    950, tg =  52.26 t/s, tg_3s =  39.03 t/s
0.50.815.488 I slot print_timing: id  0 | task 0 | n_decoded =   1070, tg =  50.43 t/s, tg_3s =  39.49 t/s
0.53.832.416 I slot print_timing: id  0 | task 0 | n_decoded =   1279, tg =  52.77 t/s, tg_3s =  69.28 t/s
0.56.852.714 I slot print_timing: id  0 | task 0 | n_decoded =   1652, tg =  60.61 t/s, tg_3s = 123.50 t/s
0.59.859.107 I slot print_timing: id  0 | task 0 | n_decoded =   1940, tg =  64.11 t/s, tg_3s =  95.80 t/s
1.00.035.253 I slot print_timing: id  0 | task 0 | prompt eval time =     320.58 ms /    18 tokens (   17.81 ms per token,    56.15 tokens per second)
1.00.035.256 I slot print_timing: id  0 | task 0 |        eval time =   30437.80 ms /  1945 tokens (   15.65 ms per token,    63.90 tokens per second)
1.00.035.257 I slot print_timing: id  0 | task 0 |       total time =   30758.38 ms /  1963 tokens
1.00.035.260 I slot print_timing: id  0 | task 0 |    graphs reused =        679
1.00.035.264 I slot print_timing: id  0 | task 0 | draft acceptance = 0.26239 ( 1260 accepted /  4802 generated), mean len =  2.84
1.00.035.300 I slot      release: id  0 | task 0 | stop processing: n_tokens = 1964, truncated = 0

the acceptance is only 0.26239, and occurred an error like this: E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting

when i use this script:

llama-server \
    -m  Qwen3.6-27B-Q4_K_M.gguf \
    -mm mmproj-BF16.gguf \
    -md Qwen3.6-27B-DFlash-b16.gguf \
    --spec-type draft-dflash \
    --spec-draft-n-max 7 \
    --temp 0 \
    --top-k 1 \
    -np 1 \
    -c 10240 \
    --host 0.0.0.0 \
    --port 9090 \
    --alias Qwen3.6-27B \
    -ngl 99 \
    -fa on \

i get this log with some errs:

0.00.075.313 I cmn  common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.234.132 I srv    load_model: loading model 'Qwen3.6-27B-Q4_K_M.gguf'
0.00.558.158 E llama_init_from_model: failed to initialize the context: dflash requires ctx_other to be set (this warning is normal during memory fitting)
0.00.600.804 W srv    load_model: [spec] failed to measure draft model memory: failed to create llama_context from model
0.08.044.149 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.08.044.152 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.08.044.152 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

0.08.462.134 I srv    load_model: loaded multimodal model, 'mmproj-BF16.gguf'
0.08.495.743 I srv    load_model: initializing, n_slots = 1, n_ctx_slot = 10240, kv_unified = 'false'
0.08.495.783 I common_speculative_impl_draft_dflash: adding speculative implementation 'draft-dflash'
0.08.495.787 I common_speculative_impl_draft_dflash: - n_max=7, n_min=0, p_min=0.00
0.08.495.787 I common_speculative_impl_draft_dflash: - block_size=16, mask_token_id=248070, n_extract=5
0.08.529.704 I srv          init: chat template supports preserving reasoning, consider enabling it via --reasoning-preserve
0.08.529.733 I srv  llama_server: model loaded
0.08.529.736 I srv  llama_server: listening on http://0.0.0.0:9090
0.52.682.801 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.52.682.860 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.53.107.345 W find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
0.53.107.348 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
0.53.107.349 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 68 new tokens
0.53.107.652 W find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
0.53.227.025 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
0.53.401.721 W find_slot: non-consecutive token position 8 after 8 for sequence 0 with 68 new tokens
0.53.500.326 W find_slot: non-consecutive token position 50 after 8 for sequence 0 with 4 new tokens
0.53.500.343 W find_slot: non-consecutive token position 50 after 8 for sequence 0 with 4 new tokens
0.53.538.948 E init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 7
 - the tokens for sequence 0 in the input batch have a starting position of Y = 47
 it is required that the sequence positions remain consecutive: Y = X + 1
0.53.538.951 E decode: failed to initialize batch
0.53.538.952 E llama_decode: failed to decode, ret = -1
0.53.538.952 E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, offset=0)
0.53.538.955 E srv        decode: failed to process speculative batch
0.53.538.978 E srv  update_slots: decode() failed: failed to process speculative batch
0.53.538.983 E srv    send_error: task id = 0, error: decode() failed: failed to process speculative batch
0.53.538.986 I slot      release: id  0 | task 0 | stop processing: n_tokens = 1104, truncated = 0
0.53.539.031 W srv          stop: cancel task, id_task = 0

the difference is that i add mmproj-BF16.gguf

mmproj-BF16.gguf
Qwen3.6-27B-DFlash-b16.gguf
Qwen3.6-27B-DFlash-q8_0.gguf
Qwen3.6-27B-Q4_K_M.gguf
Qwen3.6-27B-UD-Q4_K_XL.gguf
Qwen3.6-27B-UD-Q5_K_XL.gguf
server1.sh

DFlash's ggufs are from script of convert_hf_to_gguf.py
Q4、UD-Q4、UD-Q5 ggufs are from unsloth

gelim · 2026-07-02T15:51:22Z

@nipeone you can try to reproduce my tests using the same GGUF (williamliao/qwen3.6-27B-DFlash-GGUF for the drafter)

the "dflash requires ctx_other " warning disappears if you don't need memory fitting with -fit off

Side note, if changing the drafter to the F16 version, I get 0.77 draft acceptance and 47 tok/sec for my coding use-case.
I personally use the same quant on the full model & its drafter.

* spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

JamesPhlao · 2026-07-03T06:02:38Z

CUDA on Tesla V100 32gb cards

@gelim can single Tesla V100 32gb work well?

gelim · 2026-07-03T06:18:58Z

CUDA on Tesla V100 32gb cards

@gelim can single Tesla V100 32gb work well?

You can judge by the second table I put in a previous comment in this thread. I'm fine with it perf wise. That can of course always be faster but for peer coding it's cool.

If the question is should I buy those old cards for my local llm needs. I would not do it. CUDA capabilities are old (7.0).

github-actions Bot added model Model specific examples python python script changes server labels Apr 19, 2026

ruixiang63 mentioned this pull request Apr 19, 2026

[Speculative decoding] feat: add EAGLE3 speculative decoding support #18039

Merged

ruixiang63 mentioned this pull request Apr 19, 2026

Feature Request: DFLASH support (from 40 tok/sec to 400 tok/sec) #21978

Closed

4 tasks

noonghunna mentioned this pull request Apr 20, 2026

[Bug]: Gemma-4 + DFlash unservable on Ampere — non-causal + head_dim=256 has no compatible attention backend vllm-project/vllm#40382

Open

aminya mentioned this pull request Apr 23, 2026

Turbo dflash TheTom/llama-cpp-turboquant#103

Draft

ruixiang63 mentioned this pull request Apr 23, 2026

Llama cpp support z-lab/dflash#50

Open

ruixiang63 force-pushed the dflash branch from d1d2c81 to e344c4a Compare April 24, 2026 18:01

ypomortsev reviewed Apr 24, 2026

View reviewed changes

Comment thread common/arg.cpp Outdated

ggerganov merged commit d1b3425 into ggml-org:master Jun 28, 2026
27 checks passed

CISC reviewed Jun 28, 2026

View reviewed changes

This was referenced Jun 28, 2026

dflash: fix VL/multimodal crash by clearing draft on position gap ruixiang63/llama.cpp#7

Closed

dflash: refactor draft model conversion #25110

Merged

waldirsp11 mentioned this pull request Jun 29, 2026

Eval bug: QWEN3.6 27b DFlash draft model fails to load #25116

Closed

jerrydong1988 mentioned this pull request Jun 29, 2026

DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline #25117

Open

github-actions Bot mentioned this pull request Jun 29, 2026

Reddit News Daily 2026-06-29 gitlawr/reddit-daily-news#290

Open

bernardladenthin mentioned this pull request Jun 29, 2026

Upgrade llama.cpp from b9829 to b9840 bernardladenthin/java-llama.cpp#280

Merged

6 tasks

oobabooga mentioned this pull request Jun 29, 2026

Studio: add DFlash speculative decoding support unslothai/unsloth#6747

Open

wjinxu mentioned this pull request Jun 30, 2026

spec: add DSpark speculative decoding #25173

Open

TheTom mentioned this pull request Jul 1, 2026

eval: cherry-pick upstream DFlash spec-decode (#22105 + #25110, dep #24707) — CI check TheTom/llama-cpp-turboquant#201

Draft

Uh oh!

Conversation

ruixiang63 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

How to run DFlash in llama.cpp

Step 1: Convert models to GGUF

[Optional] Step 2: Quantize GGUF models

Step 3: Build llama.cpp

Step 4: Run DFlash speculative decoding

Performance Evaluation

Qwen3-8B

Qwen3-4B

GPT-OSS-20B

Qwen3.5-4B

Qwen3.5-9B

Future Performance Work

Hybrid target model performance improvement (For all speculative decoding methods)

More (Low Priority)

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 19, 2026

Uh oh!

am17an commented Apr 19, 2026

Uh oh!

ruixiang63 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noonghunna commented Apr 20, 2026

Issue 1 (small, easy): d2t / t2d not handled in DFlashModel

in DFlashModel.modify_tensors, before the super() fallthrough

Issue 2 (bigger): `gguf.MODEL_ARCH.DFLASH` tensor list is missing `TOKEN_EMBD` and `OUTPUT`

Uh oh!

ruixiang63 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanbaldo commented Apr 24, 2026

Uh oh!

ruixiang63 commented Apr 24, 2026

Uh oh!

Uh oh!

kroaton commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonhJonhD commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunYong0821 commented Apr 27, 2026

Uh oh!

rlex commented Apr 27, 2026

Uh oh!

SunYong0821 commented Apr 27, 2026

Uh oh!

Raghuboi commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbednarek360 commented Apr 27, 2026

Uh oh!

mbednarek360 commented Apr 27, 2026

Uh oh!

ruixiang63 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HH1162 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aminya commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ukrospm commented Apr 28, 2026

Uh oh!

Uh oh!

CISC Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Jun 28, 2026

Choose a reason for hiding this comment

ruixiang63 commented Apr 19, 2026 •

edited

Loading

ruixiang63 commented Apr 19, 2026 •

edited

Loading

Issue 1 (small, easy): `d2t` / `t2d` not handled in DFlashModel

ruixiang63 commented Apr 24, 2026 •

edited

Loading

kroaton commented Apr 24, 2026 •

edited

Loading

JonhJonhD commented Apr 26, 2026 •

edited

Loading

Raghuboi commented Apr 27, 2026 •

edited

Loading

ruixiang63 commented Apr 27, 2026 •

edited

Loading

HH1162 commented Apr 27, 2026 •

edited

Loading

aminya commented Apr 28, 2026 •

edited

Loading

ulyssesrr commented Jun 28, 2026 •

edited

Loading

AbdulrahmanHashem commented Jun 28, 2026 •

edited

Loading

gelim commented Jun 30, 2026 •

edited

Loading

gelim commented Jul 1, 2026 •

edited

Loading

nipeone commented Jul 1, 2026 •

edited

Loading

gelim commented Jul 1, 2026 •

edited

Loading

gelim commented Jul 1, 2026 •

edited

Loading

nipeone commented Jul 2, 2026 •

edited

Loading

gelim commented Jul 2, 2026 •

edited

Loading