You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DFlash is 2x slower than running without any speculation. This is unexpected -- the DFlash paper claims up to 4.21x speedup on Qwen3.5 models.
The n_decode_total counter shows ~7291 decode calls for only ~100 generated tokens. This is extremely high and suggests massive overhead from DFlash's process() encoder/injection calls for every committed token.
The base model is quantized (UD-Q4_K_XL) but the DFlash draft model is F16. DFlash's encoder extracts intermediate layer embeddings from the target model. When the target is dynamically quantized (Unsloth Dynamic), the intermediate activation distributions may differ significantly from the BF16 version the draft model was trained with. This likely causes poor draft quality and low acceptance rate, making the draft overhead a net loss.
The target model is MoE (122B total, 10B active). PR [Speculative decoding] feat: add DFlash support #22105 explicitly notes that MoE targets see smaller speedups because parallel verification activates more experts than single-token decoding, increasing computation per verify step. For gpt-oss-20b MoE, the PR reports DFlash speedup as low as 0.61x.
ROCm (HIP) is a newly supported backend for DFlash. The implementation was just merged (June 28). HIP-specific performance tuning may not yet be applied (e.g., CUDA graphs, warp-level optimizations).
Potential causes
Quantization mismatch: DFlash encoder expects BF16 target activations, but the target is UD-Q4_K_XL. This likely corrupts the feature extraction -> poor draft quality -> near-zero acceptance -> all overhead, no gain.
MoE parallel-verification overhead: DFlash verifies a block of N tokens at once. For MoE, this means up to N*K experts activated simultaneously vs K for single-token decode. On APU shared memory, this may cause bandwidth contention.
Feature injection overhead per token: DFlash's process() method extracts 8 layer embeddings from the target model and injects them into the draft model's KV cache for every new token. Each injection adds a llama_decode call, significantly inflating the total decode count.
HIP backend immaturity for DFlash: The draft model's small 6-layer decoder graph and non-causal attention pattern may not be well-optimized on ROCm yet.
Questions
Is DFlash expected to work with quantized target models, or does it require BF16/F16 targets?
Is there any recommended way to improve DFlash acceptance rate on MoE targets?
Could the feature extraction / injection overhead be reduced in subsequent iterations?
Happy to provide additional logs, run tests with different quant levels, or test patches.
DFlash performance regression on AMD APU + quantized MoE target: ~2x slower than baseline (no speculation)
Environment
Models
Target model
DFlash draft model
MTP draft model (for comparison)
MTP version (from unsloth) of the same base model, --spec-type draft-mtp, achieves ~20 t/s on the same hardware.
Reproduction
Launch command (DFlash)
Running from the model directory with relative paths:
Launch command (Baseline, no speculation)
Same command without -md, --spec-draft-n-max, --spec-type.
Launch command (MTP, for comparison)
Same command but with --spec-type draft-mtp (using the MTP variant of the same base model).
Results
Server logs (DFlash):
Metrics endpoint:
The slot also confirms speculative mode is active:
Observations
DFlash is 2x slower than running without any speculation. This is unexpected -- the DFlash paper claims up to 4.21x speedup on Qwen3.5 models.
The n_decode_total counter shows ~7291 decode calls for only ~100 generated tokens. This is extremely high and suggests massive overhead from DFlash's process() encoder/injection calls for every committed token.
The base model is quantized (UD-Q4_K_XL) but the DFlash draft model is F16. DFlash's encoder extracts intermediate layer embeddings from the target model. When the target is dynamically quantized (Unsloth Dynamic), the intermediate activation distributions may differ significantly from the BF16 version the draft model was trained with. This likely causes poor draft quality and low acceptance rate, making the draft overhead a net loss.
The target model is MoE (122B total, 10B active). PR [Speculative decoding] feat: add DFlash support #22105 explicitly notes that MoE targets see smaller speedups because parallel verification activates more experts than single-token decoding, increasing computation per verify step. For gpt-oss-20b MoE, the PR reports DFlash speedup as low as 0.61x.
ROCm (HIP) is a newly supported backend for DFlash. The implementation was just merged (June 28). HIP-specific performance tuning may not yet be applied (e.g., CUDA graphs, warp-level optimizations).
Potential causes
Quantization mismatch: DFlash encoder expects BF16 target activations, but the target is UD-Q4_K_XL. This likely corrupts the feature extraction -> poor draft quality -> near-zero acceptance -> all overhead, no gain.
MoE parallel-verification overhead: DFlash verifies a block of N tokens at once. For MoE, this means up to N*K experts activated simultaneously vs K for single-token decode. On APU shared memory, this may cause bandwidth contention.
Feature injection overhead per token: DFlash's process() method extracts 8 layer embeddings from the target model and injects them into the draft model's KV cache for every new token. Each injection adds a llama_decode call, significantly inflating the total decode count.
HIP backend immaturity for DFlash: The draft model's small 6-layer decoder graph and non-causal attention pattern may not be well-optimized on ROCm yet.
Questions
Happy to provide additional logs, run tests with different quant levels, or test patches.