Paper (Coming Soon) | Blog | Models
DFlash is a lightweight block diffusion model designed for speculative decoding. It enables efficient and high-quality parallel drafting.
DFlash_demo.mp4
- Qwen3-4B: https://huggingface.co/z-lab/Qwen3-4B-DFlash-b16
- Qwen3-8B: https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16
- Qwen3-Coder-30B-A3B: https://huggingface.co/z-lab/Qwen3-Coder-30B-A3B-DFlash
- meta-llama/Llama-3.1-8B-Instruct
- openai/gpt-oss-120b
- zai-org/GLM-4.7
- zai-org/GLM-4.7-Flash
💡 Feel free to open a GitHub issue if you’d like to request support for additional models!
We will also open-source the training recipe soon, so you can train your own DFlash draft model to accelerate any LLM.
DFlash is now supported on SGLang, enabling high-throughput speculative decoding in a production-grade serving stack. vLLM integration is currently in progress.
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
--tp-size 1 \
--dtype bfloat16 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--trust-remote-codeconda create -n dflash python=3.11
conda activate dflash
git clone https://github.com/z-lab/dflash.git
cd dflash
pip install -r requirements.txt
pip install flash-attn --no-build-isolationThe following example demonstrates how to load the DFlash drafter and the Qwen3-8B target model to perform speculative decoding.
import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
# 1. Load the DFlash Draft Model
# Note: trust_remote_code=True is required for the custom diffusion architecture. We recommend run on one GPU currently.
model = AutoModel.from_pretrained(
"z-lab/Qwen3-8B-DFlash-b16",
trust_remote_code=True,
dtype="auto",
device_map="cuda:0"
).eval()
# 2. Load the Target Model
target = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
dtype="auto",
device_map="cuda:0"
).eval()
# 3. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
# Essential: Add the mask token required for diffusion steps
tokenizer.add_special_tokens({"mask_token": "<|MASK|>"})
# 4. Prepare Input
prompt = "How many positive whole-number divisors does 196 have?"
messages = [
{"role": "user", "content": prompt}
]
# Note: this draft model is used for thinking mode disabled
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# 5. Run Speculative Decoding
# The 'spec_generate' function is a custom method provided by the DFlash model
generate_ids = model.spec_generate(
input_ids=model_inputs["input_ids"],
max_new_tokens=2048,
temperature=0.0,
target=target,
mask_token_id=tokenizer.mask_token_id,
stop_token_ids=[tokenizer.eos_token_id]
)
print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))We provide scripts to reproduce our speedup and acceptance length metrics. The reported results were tested on NVIDIA B200 GPUs.
To run the benchmark:
bash run_benchmark.shIf you find DFlash useful for your research or applications, please cite our project. The full paper is coming soon!
@article{chen2026dflash,
title = {DFlash: Block Diffusion for Flash Speculative Decoding},
author = {Chen, Jian and Liu, Zhijian},
journal = {arXiv preprint},
year = {2026},
url = {[https://github.com/z-lab/dflash](https://github.com/z-lab/dflash)},
note = {Paper coming soon}
}
