Skip to content

sands-lab/FOCUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,730 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FOCUS logo

FOCUS: DLLMs Know How to Tame Their Compute Bound

Kaihua Liang1, Xin Tan2, An Zhong1, Hong Xu2, Marco Canini1

1King Abdullah University of Science and Technology    2The Chinese University of Hong Kong

Paper on arXiv License

Design Overview

FOCUS architecture overview

FOCUS is an inference system for diffusion LLMs (DLLMs) built on top of the LMDeploy engine. It targets a key compute-bound bottleneck in block-diffusion decoding: models compute over a full token block each step, yet only a small fraction of tokens are actually decodable.

Using attention-derived token importance delta from early layers, this training-free solution predicts which tokens are likely decodable and evicts non-decodable ones on the fly to avoid redundant computation, increasing the effective batch size and enabling scalable throughput. FOCUS achieves up to 3.52× throughput improvement without compromising quality across benchmarks. This repo contains the LMDeploy-based implementation for SDAR and LLaDA2.0-mini.

Efficiency Improvement

Efficiency improvement

Key Implementation Files (FOCUS)

Based on LMDeploy, the main FOCUS-related implementations are in:

Install (CUDA)

FOCUS relies on Triton CUDA kernels and is intended for CUDA GPUs. LMDeploy's default prebuilt wheels target CUDA 12 (since v0.3.0); RTX 50-series GPUs require CUDA 12.8 wheels. CUDA 11+ is supported when building from source, but ensure your local CUDA toolkit matches your PyTorch/Triton stack. We have only tested FOCUS flow with Python 3.13; other Python versions 3.9-3.12 should also work but are not covered by our tests.

  1. Create and activate a Python environment.
  2. Install runtime dependencies:
pip install -r requirements/runtime_cuda.txt
  1. Install the repo using the PyTorch engine:
DISABLE_TURBOMIND=1 pip install -e .

Benchmarking

All scripts write logs to ./results. Run them from the repo root.

  • FOCUS throughput: benchmark/run_focus_throughput_evaluation.sh

    benchmark/run_focus_throughput_evaluation.sh <dataset_id> <model_id> [alpha]

    Example:

    benchmark/run_focus_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32
    benchmark/run_focus_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32 1.8
  • LMDeploy throughput: benchmark/run_baseline_throughput_evaluation.sh

    benchmark/run_baseline_throughput_evaluation.sh <dataset_id> <model_id>

    Example:

    benchmark/run_baseline_throughput_evaluation.sh anon8231489123/ShareGPT_Vicuna_unfiltered JetLM/SDAR-8B-Chat-b32
  • Block size comparison for SDAR: benchmark/run_block_size_comparison.sh

    benchmark/run_block_size_comparison.sh <dataset_id>

    This runs SDAR models JetLM/SDAR-8B-Chat-b16 and JetLM/SDAR-8B-Chat-b64 for both FOCUS and Base settings.

  • Delayed cache baseline for SDAR: benchmark/run_sdar_delayed_cache_benchmark.sh

    benchmark/run_sdar_delayed_cache_benchmark.sh <dataset_id>

    This runs JetLM/SDAR-8B-Chat-b32 with delayed cache enabled and FOCUS disabled.

Dataset notes:

  • dataset_id can be a HuggingFace dataset ID or a local JSON/JSONL path supported by benchmark/profile_throughput.py.
  • HuggingFace dataset IDs require the datasets package and network access.

Generation Quality Testing

For generation quality evaluation of SDAR/LLaDA2.0-mini models, see opencompass-0.5.1.post1/README.md for OpenCompass benchmarking instructions with either HuggingFace/Transformers or LMDeploy backends.

Citation

If you find FOCUS useful in your work, please cite:

@article{liang2026focus,
  title   = {FOCUS: DLLMs Know How to Tame Their Compute Bound},
  author  = {Kaihua Liang and Xin Tan and An Zhong and Hong Xu and Marco Canini},
  journal = {arXiv preprint arXiv:2601.23278},
  year    = {2026},
  url     = {https://arxiv.org/abs/2601.23278}
}

Acknowledgements

FOCUS builds on and/or includes code and models from:

About

[ICML'26] Official implementation of "FOCUS: DLLMs Know How to Tame Their Compute Bound".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors