Kaihua Liang1, Xin Tan2, An Zhong1, Hong Xu2, Marco Canini1
1King Abdullah University of Science and Technology 2The Chinese University of Hong Kong
FOCUS is an inference system for diffusion LLMs (DLLMs) built on top of the LMDeploy engine. It targets a key compute-bound bottleneck in block-diffusion decoding: models compute over a full token block each step, yet only a small fraction of tokens are actually decodable.
Using attention-derived token importance delta from early layers, this training-free solution predicts which tokens are likely decodable and evicts non-decodable ones on the fly to avoid redundant computation, increasing the effective batch size and enabling scalable throughput. FOCUS achieves up to 3.52× throughput improvement without compromising quality across benchmarks. This repo contains the LMDeploy-based implementation for SDAR and LLaDA2.0-mini.
Based on LMDeploy, the main FOCUS-related implementations are in:
lmdeploy/pytorch/kernels/cuda/focus.py: Triton kernels for importance scoring, target selection, and state compaction.lmdeploy/pytorch/kernels/cuda/pagedattention.py: attention kernels (including ragged paged attention).lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py: KV-cache fill kernels (including sparse KV fill for paged attention).lmdeploy/pytorch/models/sdar.py: FOCUS eviction integrated into SDAR layers.lmdeploy/pytorch/models/llada2.py: FOCUS eviction integrated into LLaDA2.0-mini layers.lmdeploy/pytorch/strategies/dllm/sequence.py: FocusState tracking and per-step statistics.lmdeploy/pytorch/strategies/dllm/model_inputs.py: focus-specific inputs for DLLM batches.lmdeploy/pytorch/model_inputs.py: focus runtime view and host/device synchronization.lmdeploy/pytorch/engine/inputs_maker.py: builds focus masks and pinned buffers for delayed cache batches.lmdeploy/pytorch/engine/model_agent.py: propagates processed positions back to the scheduler.
FOCUS relies on Triton CUDA kernels and is intended for CUDA GPUs. LMDeploy's default prebuilt wheels target CUDA 12 (since v0.3.0); RTX 50-series GPUs require CUDA 12.8 wheels. CUDA 11+ is supported when building from source, but ensure your local CUDA toolkit matches your PyTorch/Triton stack. We have only tested FOCUS flow with Python 3.13; other Python versions 3.9-3.12 should also work but are not covered by our tests.
- Create and activate a Python environment.
- Install runtime dependencies:
pip install -r requirements/runtime_cuda.txt- Install the repo using the PyTorch engine:
DISABLE_TURBOMIND=1 pip install -e .All scripts write logs to ./results. Run them from the repo root.
-
FOCUS throughput:
benchmark/run_focus_throughput_evaluation.shbenchmark/run_focus_throughput_evaluation.sh <dataset_id> <model_id> [alpha]
Example:
benchmark/run_focus_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32 benchmark/run_focus_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32 1.8
-
LMDeploy throughput:
benchmark/run_baseline_throughput_evaluation.shbenchmark/run_baseline_throughput_evaluation.sh <dataset_id> <model_id>
Example:
benchmark/run_baseline_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32
-
Block size comparison for SDAR:
benchmark/run_block_size_comparison.shbenchmark/run_block_size_comparison.sh <dataset_id>
This runs SDAR models
JetLM/SDAR-8B-Chat-b16andJetLM/SDAR-8B-Chat-b64for both FOCUS and Base settings. -
Delayed cache baseline for SDAR:
benchmark/run_sdar_delayed_cache_benchmark.shbenchmark/run_sdar_delayed_cache_benchmark.sh <dataset_id>
This runs
JetLM/SDAR-8B-Chat-b32with delayed cache enabled and FOCUS disabled.
Dataset notes:
dataset_idcan be a HuggingFace dataset ID or a local JSON/JSONL path supported bybenchmark/profile_throughput.py.- HuggingFace dataset IDs require the
datasetspackage and network access.
For generation quality evaluation of SDAR/LLaDA2.0-mini models, see opencompass-0.5.1.post1/README.md for OpenCompass benchmarking instructions with either HuggingFace/Transformers or LMDeploy backends.
If you find FOCUS useful in your work, please cite:
@article{liang2026focus,
title = {FOCUS: DLLMs Know How to Tame Their Compute Bound},
author = {Kaihua Liang and Xin Tan and An Zhong and Hong Xu and Marco Canini},
journal = {arXiv preprint arXiv:2601.23278},
year = {2026},
url = {https://arxiv.org/abs/2601.23278}
}FOCUS builds on and/or includes code and models from:
- LMDeploy
- OpenCompass (vendored snapshot under
opencompass-0.5.1.post1/) - SDAR: JetAstra/SDAR | weights: JetLM/SDAR-8B-Chat-b16, JetLM/SDAR-8B-Chat-b32, JetLM/SDAR-8B-Chat-b64
- LLaDA2.0: inclusionAI/LLaDA2.0 | weights: inclusionAI/LLaDA2.0-mini

