DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline

# DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline (no speculation)

## Environment

| Component | Details |
|---|---|
| Platform | Windows 11 24H2, AMD Strix Halo (AI MAX+ 395 w/ Radeon 8060S) |
| GPU VRAM | 110,456 MiB (shared / UMA) |
| System RAM | 32 GB |
| ROCm SDK | 7.13.0 (tech preview, pip-installed rocm-sdk-devel==7.13.0) |
| Compiler | AMD Clang 23.0.0git (HIP 7.13) |
| llama.cpp | b9832 (f68a788b0), built with GGML_HIP=ON, GGML_RPC=ON |
| Backend | ROCm (HIP) on gfx1151 (RDNA 3.5) |

## Models

### Target model
- Source: unsloth/Qwen3.5-122B-A10B-GGUF
- Architecture: Qwen3.5 MoE (122B total / ~10B active per token)
- Quantization: Unsloth Dynamic UD-Q4_K_XL (per-layer adaptive quantization)
- Size: ~71.7 GB (3 split GGUF files)
- Multimodal projector: mmproj-F16.gguf (~0.85 GB)

### DFlash draft model
- Source: z-lab/Qwen3.5-122B-A10B-DFlash (HuggingFace, original weights are BF16)
- Architecture: DFlashDraftModel (diffusion-style block draft, 6 transformer layers)
- Conversion: Converted by us via convert_hf_to_gguf.py with --outtype f16
- --target-model-dir: Qwen/Qwen3.5-122B-A10B (for tokenizer extraction)
- GGUF size: ~1.45 GB

### MTP draft model (for comparison)
MTP version (from unsloth) of the same base model, --spec-type draft-mtp, achieves ~20 t/s on the same hardware.

## Reproduction

### Launch command (DFlash)
Running from the model directory with relative paths:

```powershell
cd $MODEL_DIR
llama-server.exe `
  -m "Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf" `
  --mmproj "mmproj-F16.gguf" `
  --jinja -ngl 99 -b 4096 -ub 2048 -np 1 `
  --fit off --kv-unified `
  -md "Qwen3.5-122B-A10B-DFlash.gguf" `
  --spec-draft-n-max 7 --spec-type draft-dflash `
  --host 127.0.0.1 --port 8080 `
  -n -1 --temp 0.6 --top-k 20 --top-p 0.7 `
  --repeat-penalty 1 --min-p 0.05 `
  -to 3600 --metrics --props --prefill-assistant
```

### Launch command (Baseline, no speculation)
Same command without -md, --spec-draft-n-max, --spec-type.

### Launch command (MTP, for comparison)
Same command but with --spec-type draft-mtp (using the MTP variant of the same base model).

## Results

| Mode | Generation speed | Speedup vs baseline |
|---|---|---|
| No speculation (baseline) | 19.5 t/s | 1.00x |
| MTP (built-in MTP heads) | ~20 t/s | ~1.03x |
| DFlash (draft-dflash) | ~9.4 t/s | 0.48x |

Server logs (DFlash):
```
I common_speculative_impl_draft_dflash: adding speculative implementation 'draft-dflash'
I common_speculative_impl_draft_dflash: - n_max=7, n_min=0, p_min=0.00
I common_speculative_impl_draft_dflash: - block_size=16, mask_token_id=248077, n_extract=8
```

Metrics endpoint:
```
llamacpp:predicted_tokens_seconds 9.44
llamacpp:n_decode_total 7291
```

The slot also confirms speculative mode is active:
```json
"speculative": true,
"speculative.types": "none,draft-dflash"
```

## Observations

1. DFlash is 2x slower than running without any speculation. This is unexpected -- the DFlash paper claims up to 4.21x speedup on Qwen3.5 models.

2. The n_decode_total counter shows ~7291 decode calls for only ~100 generated tokens. This is extremely high and suggests massive overhead from DFlash's process() encoder/injection calls for every committed token.

3. The base model is quantized (UD-Q4_K_XL) but the DFlash draft model is F16. DFlash's encoder extracts intermediate layer embeddings from the target model. When the target is dynamically quantized (Unsloth Dynamic), the intermediate activation distributions may differ significantly from the BF16 version the draft model was trained with. This likely causes poor draft quality and low acceptance rate, making the draft overhead a net loss.

4. The target model is MoE (122B total, 10B active). PR #22105 explicitly notes that MoE targets see smaller speedups because parallel verification activates more experts than single-token decoding, increasing computation per verify step. For gpt-oss-20b MoE, the PR reports DFlash speedup as low as 0.61x.

5. ROCm (HIP) is a newly supported backend for DFlash. The implementation was just merged (June 28). HIP-specific performance tuning may not yet be applied (e.g., CUDA graphs, warp-level optimizations).

## Potential causes

1. Quantization mismatch: DFlash encoder expects BF16 target activations, but the target is UD-Q4_K_XL. This likely corrupts the feature extraction -> poor draft quality -> near-zero acceptance -> all overhead, no gain.

2. MoE parallel-verification overhead: DFlash verifies a block of N tokens at once. For MoE, this means up to N*K experts activated simultaneously vs K for single-token decode. On APU shared memory, this may cause bandwidth contention.

3. Feature injection overhead per token: DFlash's process() method extracts 8 layer embeddings from the target model and injects them into the draft model's KV cache for every new token. Each injection adds a llama_decode call, significantly inflating the total decode count.

4. HIP backend immaturity for DFlash: The draft model's small 6-layer decoder graph and non-causal attention pattern may not be well-optimized on ROCm yet.

## Questions

- Is DFlash expected to work with quantized target models, or does it require BF16/F16 targets?
- Is there any recommended way to improve DFlash acceptance rate on MoE targets?
- Could the feature extraction / injection overhead be reduced in subsequent iterations?

Happy to provide additional logs, run tests with different quant levels, or test patches.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline #25117

DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline (no speculation)

Environment

Models

Target model

DFlash draft model

MTP draft model (for comparison)

Reproduction

Launch command (DFlash)

Launch command (Baseline, no speculation)

Launch command (MTP, for comparison)

Results

Observations

Potential causes

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Details
Platform	Windows 11 24H2, AMD Strix Halo (AI MAX+ 395 w/ Radeon 8060S)
GPU VRAM	110,456 MiB (shared / UMA)
System RAM	32 GB
ROCm SDK	7.13.0 (tech preview, pip-installed rocm-sdk-devel==7.13.0)
Compiler	AMD Clang 23.0.0git (HIP 7.13)
llama.cpp	b9832 (`f68a788`), built with GGML_HIP=ON, GGML_RPC=ON
Backend	ROCm (HIP) on gfx1151 (RDNA 3.5)

Mode	Generation speed	Speedup vs baseline
No speculation (baseline)	19.5 t/s	1.00x
MTP (built-in MTP heads)	~20 t/s	~1.03x
DFlash (draft-dflash)	~9.4 t/s	0.48x

Uh oh!

DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline #25117

Description

DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline (no speculation)

Environment

Models

Target model

DFlash draft model

MTP draft model (for comparison)

Reproduction

Launch command (DFlash)

Launch command (Baseline, no speculation)

Launch command (MTP, for comparison)

Results

Observations

Potential causes

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions