Skip to content

DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline #25117

Description

@jerrydong1988

DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline (no speculation)

Environment

Component Details
Platform Windows 11 24H2, AMD Strix Halo (AI MAX+ 395 w/ Radeon 8060S)
GPU VRAM 110,456 MiB (shared / UMA)
System RAM 32 GB
ROCm SDK 7.13.0 (tech preview, pip-installed rocm-sdk-devel==7.13.0)
Compiler AMD Clang 23.0.0git (HIP 7.13)
llama.cpp b9832 (f68a788), built with GGML_HIP=ON, GGML_RPC=ON
Backend ROCm (HIP) on gfx1151 (RDNA 3.5)

Models

Target model

  • Source: unsloth/Qwen3.5-122B-A10B-GGUF
  • Architecture: Qwen3.5 MoE (122B total / ~10B active per token)
  • Quantization: Unsloth Dynamic UD-Q4_K_XL (per-layer adaptive quantization)
  • Size: ~71.7 GB (3 split GGUF files)
  • Multimodal projector: mmproj-F16.gguf (~0.85 GB)

DFlash draft model

  • Source: z-lab/Qwen3.5-122B-A10B-DFlash (HuggingFace, original weights are BF16)
  • Architecture: DFlashDraftModel (diffusion-style block draft, 6 transformer layers)
  • Conversion: Converted by us via convert_hf_to_gguf.py with --outtype f16
  • --target-model-dir: Qwen/Qwen3.5-122B-A10B (for tokenizer extraction)
  • GGUF size: ~1.45 GB

MTP draft model (for comparison)

MTP version (from unsloth) of the same base model, --spec-type draft-mtp, achieves ~20 t/s on the same hardware.

Reproduction

Launch command (DFlash)

Running from the model directory with relative paths:

cd $MODEL_DIR
llama-server.exe `
  -m "Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf" `
  --mmproj "mmproj-F16.gguf" `
  --jinja -ngl 99 -b 4096 -ub 2048 -np 1 `
  --fit off --kv-unified `
  -md "Qwen3.5-122B-A10B-DFlash.gguf" `
  --spec-draft-n-max 7 --spec-type draft-dflash `
  --host 127.0.0.1 --port 8080 `
  -n -1 --temp 0.6 --top-k 20 --top-p 0.7 `
  --repeat-penalty 1 --min-p 0.05 `
  -to 3600 --metrics --props --prefill-assistant

Launch command (Baseline, no speculation)

Same command without -md, --spec-draft-n-max, --spec-type.

Launch command (MTP, for comparison)

Same command but with --spec-type draft-mtp (using the MTP variant of the same base model).

Results

Mode Generation speed Speedup vs baseline
No speculation (baseline) 19.5 t/s 1.00x
MTP (built-in MTP heads) ~20 t/s ~1.03x
DFlash (draft-dflash) ~9.4 t/s 0.48x

Server logs (DFlash):

I common_speculative_impl_draft_dflash: adding speculative implementation 'draft-dflash'
I common_speculative_impl_draft_dflash: - n_max=7, n_min=0, p_min=0.00
I common_speculative_impl_draft_dflash: - block_size=16, mask_token_id=248077, n_extract=8

Metrics endpoint:

llamacpp:predicted_tokens_seconds 9.44
llamacpp:n_decode_total 7291

The slot also confirms speculative mode is active:

"speculative": true,
"speculative.types": "none,draft-dflash"

Observations

  1. DFlash is 2x slower than running without any speculation. This is unexpected -- the DFlash paper claims up to 4.21x speedup on Qwen3.5 models.

  2. The n_decode_total counter shows ~7291 decode calls for only ~100 generated tokens. This is extremely high and suggests massive overhead from DFlash's process() encoder/injection calls for every committed token.

  3. The base model is quantized (UD-Q4_K_XL) but the DFlash draft model is F16. DFlash's encoder extracts intermediate layer embeddings from the target model. When the target is dynamically quantized (Unsloth Dynamic), the intermediate activation distributions may differ significantly from the BF16 version the draft model was trained with. This likely causes poor draft quality and low acceptance rate, making the draft overhead a net loss.

  4. The target model is MoE (122B total, 10B active). PR [Speculative decoding] feat: add DFlash support #22105 explicitly notes that MoE targets see smaller speedups because parallel verification activates more experts than single-token decoding, increasing computation per verify step. For gpt-oss-20b MoE, the PR reports DFlash speedup as low as 0.61x.

  5. ROCm (HIP) is a newly supported backend for DFlash. The implementation was just merged (June 28). HIP-specific performance tuning may not yet be applied (e.g., CUDA graphs, warp-level optimizations).

Potential causes

  1. Quantization mismatch: DFlash encoder expects BF16 target activations, but the target is UD-Q4_K_XL. This likely corrupts the feature extraction -> poor draft quality -> near-zero acceptance -> all overhead, no gain.

  2. MoE parallel-verification overhead: DFlash verifies a block of N tokens at once. For MoE, this means up to N*K experts activated simultaneously vs K for single-token decode. On APU shared memory, this may cause bandwidth contention.

  3. Feature injection overhead per token: DFlash's process() method extracts 8 layer embeddings from the target model and injects them into the draft model's KV cache for every new token. Each injection adds a llama_decode call, significantly inflating the total decode count.

  4. HIP backend immaturity for DFlash: The draft model's small 6-layer decoder graph and non-causal attention pattern may not be well-optimized on ROCm yet.

Questions

  • Is DFlash expected to work with quantized target models, or does it require BF16/F16 targets?
  • Is there any recommended way to improve DFlash acceptance rate on MoE targets?
  • Could the feature extraction / injection overhead be reduced in subsequent iterations?

Happy to provide additional logs, run tests with different quant levels, or test patches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions