Skip to content

z-lab/dflash

Repository files navigation

DFlash: Block Diffusion for Flash Speculative Decoding

Paper (Coming Soon) | Blog | Models

DFlash is a lightweight block diffusion model designed for speculative decoding. It enables efficient and high-quality parallel drafting.

DFlash Architecture
DFlash_demo.mp4

📦 Model Support Plan

✅ Supported

🚧 Coming Soon

  • meta-llama/Llama-3.1-8B-Instruct
  • openai/gpt-oss-120b
  • zai-org/GLM-4.7
  • zai-org/GLM-4.7-Flash

💡 Feel free to open a GitHub issue if you’d like to request support for additional models!
We will also open-source the training recipe soon, so you can train your own DFlash draft model to accelerate any LLM.


🚀 Quick Start

Serving with SGLang

DFlash is now supported on SGLang, enabling high-throughput speculative decoding in a production-grade serving stack. vLLM integration is currently in progress.

Installation

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"

Serving

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

Transformers

Installation

conda create -n dflash python=3.11
conda activate dflash

git clone https://github.com/z-lab/dflash.git
cd dflash

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Example Usage

The following example demonstrates how to load the DFlash drafter and the Qwen3-8B target model to perform speculative decoding.

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

# 1. Load the DFlash Draft Model
# Note: trust_remote_code=True is required for the custom diffusion architecture. We recommend run on one GPU currently.
model = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16", 
    trust_remote_code=True, 
    dtype="auto", 
    device_map="cuda:0"
).eval()

# 2. Load the Target Model
target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", 
    dtype="auto", 
    device_map="cuda:0"
).eval()

# 3. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
# Essential: Add the mask token required for diffusion steps
tokenizer.add_special_tokens({"mask_token": "<|MASK|>"})

# 4. Prepare Input
prompt = "How many positive whole-number divisors does 196 have?"
messages = [
    {"role": "user", "content": prompt}
]
# Note: this draft model is used for thinking mode disabled
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# 5. Run Speculative Decoding
# The 'spec_generate' function is a custom method provided by the DFlash model
generate_ids = model.spec_generate(
    input_ids=model_inputs["input_ids"], 
    max_new_tokens=2048, 
    temperature=0.0, 
    target=target, 
    mask_token_id=tokenizer.mask_token_id, 
    stop_token_ids=[tokenizer.eos_token_id]
)

print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))

📊 Evaluation & Benchmarks

We provide scripts to reproduce our speedup and acceptance length metrics. The reported results were tested on NVIDIA B200 GPUs.

To run the benchmark:

bash run_benchmark.sh

Citation

If you find DFlash useful for your research or applications, please cite our project. The full paper is coming soon!

@article{chen2026dflash,
  title   = {DFlash: Block Diffusion for Flash Speculative Decoding},
  author  = {Chen, Jian and Liu, Zhijian},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {[https://github.com/z-lab/dflash](https://github.com/z-lab/dflash)},
  note    = {Paper coming soon}
}

About

Block Diffusion for Ultra-Fast Speculative Decoding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published