- [2026/05] 🔥 Realtime-VLA FLASH code is now available.
Realtime-VLA FLASH is the first speculative inference framework for diffusion-based VLAs.
- Speculative inference as fast as 7.8 ms (2 views), enabling over 125 Hz real-time inference.
- VLM-aligned draft architecture with a deployment-friendly block design.
- FLASH serving with customized Triton kernels, achieving a 3.04× average task-level speedup.
Follow openpi README:
git clone --recurse-submodules https://github.com/dexmal/realtime-vla-flash
# Or if you already cloned the repo:
git submodule update --init --recursiveInstall the Python environment with uv:
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .LIBERO client/evaluation code can run in a separate environment. (see LIBERO README).
First, convert the pretrained pi0 and draft checkpoints into the Triton weight layout.
uv run scripts/spec/triton/convert_for_triton.py \
--mode base \
--jax-path /path/to/jax/checkpoint \
--output converted/base
uv run scripts/spec/triton/convert_for_triton.py \
--mode draft \
--draft-ckpt /path/to/draft_model.pt \
--output converted/draftThen start the policy server and the LIBERO client.
uv run scripts/spec/spec_serve_policy.py \
--config pi0_libero \
--base-triton-path converted/base \
--draft-triton-path converted/draft \
--task-suite-name libero_goal \
--backend triton
uv run scripts/spec/spec_client_libero.py \
--task-suite-name libero_goalYou can check the inference time on your local machine by
uv run python scripts/spec/pi0_benchmark.py
uv run scripts/spec/enc_cache.py \
--config pi0_libero \
--checkpoint-dir /openpi-assets/checkpoints/pi0_libero_torch \
--task-suite-name libero_goal \
--output-dir /tmp/spec_quickstart_train/libero_goal_cache
uv run scripts/spec/spec_draft_train.py \
--cache-dir /tmp/spec_quickstart_train/libero_goal_cache \
--output draft_model_goal_torch.ptA typical workflow is:
- Build a prefix-embedding cache with
scripts/spec/enc_cache.py. - Train the draft head with
scripts/spec/spec_draft_train.py. - Serve the FLASH policy with
scripts/spec/spec_serve_policy.py. - Run LIBERO client evaluation or sweeps with
scripts/spec/spec_client_libero.pyorscripts/spec/exp/run_sweep.py.
If you find this work useful, please cite the paper once the arXiv version is available:
@article{niu2026realtimevlaflash,
title={Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs},
author={Niu, Jiahui and Gu, Kefan and Zhao, Yucheng and Liang, Shengwen and Wang, Tiancai and Hu, Xing and Wang, Ying and Li, Huawei},
journal={arXiv preprint arXiv:2605.13778},
year={2026}
}