GitHub - sands-lab/FOCUS: [ICML'26] Official implementation of "FOCUS: DLLMs Know How to Tame Their Compute Bound".

FOCUS: DLLMs Know How to Tame Their Compute Bound

Kaihua Liang¹, Xin Tan², An Zhong¹, Hong Xu², Marco Canini¹

¹King Abdullah University of Science and Technology ²The Chinese University of Hong Kong

Design Overview

FOCUS is an inference system for diffusion LLMs (DLLMs) built on top of the LMDeploy engine. It targets a key compute-bound bottleneck in block-diffusion decoding: models compute over a full token block each step, yet only a small fraction of tokens are actually decodable.

Using attention-derived token importance delta from early layers, this training-free solution predicts which tokens are likely decodable and evicts non-decodable ones on the fly to avoid redundant computation, increasing the effective batch size and enabling scalable throughput. FOCUS achieves up to 3.52× throughput improvement without compromising quality across benchmarks. This repo contains the LMDeploy-based implementation for SDAR and LLaDA2.0-mini.

Efficiency Improvement

Key Implementation Files (FOCUS)

Based on LMDeploy, the main FOCUS-related implementations are in:

lmdeploy/pytorch/kernels/cuda/focus.py: Triton kernels for importance scoring, target selection, and state compaction.
lmdeploy/pytorch/kernels/cuda/pagedattention.py: attention kernels (including ragged paged attention).
lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py: KV-cache fill kernels (including sparse KV fill for paged attention).
lmdeploy/pytorch/models/sdar.py: FOCUS eviction integrated into SDAR layers.
lmdeploy/pytorch/models/llada2.py: FOCUS eviction integrated into LLaDA2.0-mini layers.
lmdeploy/pytorch/strategies/dllm/sequence.py: FocusState tracking and per-step statistics.
lmdeploy/pytorch/strategies/dllm/model_inputs.py: focus-specific inputs for DLLM batches.
lmdeploy/pytorch/model_inputs.py: focus runtime view and host/device synchronization.
lmdeploy/pytorch/engine/inputs_maker.py: builds focus masks and pinned buffers for delayed cache batches.
lmdeploy/pytorch/engine/model_agent.py: propagates processed positions back to the scheduler.

Install (CUDA)

FOCUS relies on Triton CUDA kernels and is intended for CUDA GPUs. LMDeploy's default prebuilt wheels target CUDA 12 (since v0.3.0); RTX 50-series GPUs require CUDA 12.8 wheels. CUDA 11+ is supported when building from source, but ensure your local CUDA toolkit matches your PyTorch/Triton stack. We have only tested FOCUS flow with Python 3.13; other Python versions 3.9-3.12 should also work but are not covered by our tests.

Create and activate a Python environment.
Install runtime dependencies:

pip install -r requirements/runtime_cuda.txt

Install the repo using the PyTorch engine:

DISABLE_TURBOMIND=1 pip install -e .

Benchmarking

All scripts write logs to ./results. Run them from the repo root.

FOCUS throughput: benchmark/run_focus_throughput_evaluation.sh

benchmark/run_focus_throughput_evaluation.sh <dataset_id> <model_id> [alpha]

Example:

benchmark/run_focus_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32
benchmark/run_focus_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32 1.8

LMDeploy throughput: benchmark/run_baseline_throughput_evaluation.sh

benchmark/run_baseline_throughput_evaluation.sh <dataset_id> <model_id>

Example:

benchmark/run_baseline_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32

Block size comparison for SDAR: benchmark/run_block_size_comparison.sh
```
benchmark/run_block_size_comparison.sh <dataset_id>
```
This runs SDAR models JetLM/SDAR-8B-Chat-b16 and JetLM/SDAR-8B-Chat-b64 for both FOCUS and Base settings.
Delayed cache baseline for SDAR: benchmark/run_sdar_delayed_cache_benchmark.sh
```
benchmark/run_sdar_delayed_cache_benchmark.sh <dataset_id>
```
This runs JetLM/SDAR-8B-Chat-b32 with delayed cache enabled and FOCUS disabled.

Dataset notes:

dataset_id can be a HuggingFace dataset ID or a local JSON/JSONL path supported by benchmark/profile_throughput.py.
HuggingFace dataset IDs require the datasets package and network access.

Generation Quality Testing

For generation quality evaluation of SDAR/LLaDA2.0-mini models, see opencompass-0.5.1.post1/README.md for OpenCompass benchmarking instructions with either HuggingFace/Transformers or LMDeploy backends.

Citation

If you find FOCUS useful in your work, please cite:

@article{liang2026focus,
  title   = {FOCUS: DLLMs Know How to Tame Their Compute Bound},
  author  = {Kaihua Liang and Xin Tan and An Zhong and Hong Xu and Marco Canini},
  journal = {arXiv preprint arXiv:2601.23278},
  year    = {2026},
  url     = {https://arxiv.org/abs/2601.23278}
}

Acknowledgements

FOCUS builds on and/or includes code and models from:

LMDeploy
OpenCompass (vendored snapshot under opencompass-0.5.1.post1/)
SDAR: JetAstra/SDAR | weights: JetLM/SDAR-8B-Chat-b16, JetLM/SDAR-8B-Chat-b32, JetLM/SDAR-8B-Chat-b64
LLaDA2.0: inclusionAI/LLaDA2.0 | weights: inclusionAI/LLaDA2.0-mini

Name		Name	Last commit message	Last commit date
Latest commit History 1,730 Commits
.github		.github
assets		assets
autotest		autotest
benchmark		benchmark
builder		builder
cmake		cmake
docker		docker
docs		docs
eval		eval
k8s		k8s
lmdeploy		lmdeploy
opencompass-0.5.1.post1		opencompass-0.5.1.post1
requirements		requirements
resources		resources
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
debug.sh		debug.sh
generate.sh		generate.sh
pyproject.toml		pyproject.toml
requirements_ascend.txt		requirements_ascend.txt
requirements_camb.txt		requirements_camb.txt
requirements_cuda.txt		requirements_cuda.txt
requirements_maca.txt		requirements_maca.txt
requirements_rocm.txt		requirements_rocm.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FOCUS: DLLMs Know How to Tame Their Compute Bound

Design Overview

Efficiency Improvement

Key Implementation Files (FOCUS)

Install (CUDA)

Benchmarking

Generation Quality Testing

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FOCUS: DLLMs Know How to Tame Their Compute Bound

Design Overview

Efficiency Improvement

Key Implementation Files (FOCUS)

Install (CUDA)

Benchmarking

Generation Quality Testing

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages