Qiqi Liu1,2,3*, Huan Xu3*, Jingyu Li1,2,3, Bin Sun3†, Zhihui Hao3†, Dangen She3, Xiatian Zhu4, Li Zhang1,2‡
1Fudan University; 2Shanghai Innovation Institute; 3Li Auto Inc.; 4University of Surrey
* equal contribution; † project leader; ‡ corresponding author
- [2026-04] Paper released on arXiv.
Uni-World VLA is a unified Vision-Language-Action model for autonomous driving that performs interleaved world modeling and planning. It jointly predicts future visual observations and ego trajectories in a single autoregressive sequence, tightly coupling world understanding with planning under temporal causality.
- Interleaved world modeling and planning: alternates future frame prediction and ego action/trajectory generation step-by-step, forming a closed-loop interaction that keeps planning conditioned on imagined observations.
- Unified autoregressive VLA formulation: generates visual tokens and action queries in a single sequence, tightly coupling prediction and control under temporal causality.
- Depth integration for geometric cues: augments historical frames with monocular depth maps and fuses geometry features via cross-attention to improve long-horizon scene prediction.
Table 1. Closed-loop planning results on NAVSIM.
Table 2. World modeling / prediction results on NAVSIM.
Visualization.
git clone --recurse-submodules https://github.com/LogosRoboticsGroup/UniWorldVLA.git
cd UniWorldVLANote: The
DA3(Depth Anything 3) component is included as a git submodule. If you cloned without--recurse-submodules, run:git submodule update --init --recursive
conda create -n uniworld python=3.10 -y
conda activate uniworldpip install torch==2.2.1 torchvision==0.17.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtFull pinned environment (including all transitive deps) is available in
environment.yamlfor exact reproducibility:conda env create -f environment.yaml
Model weights are released at SII-Rigby/UniWorldVLA on HuggingFace.
See SETUP.md for detailed instructions on:
- Downloading backbone model weights (Show-O, Phi-1.5, MagViT-v2, I3D)
- Downloading our released checkpoints (VQ tokenizer, pre-trained model, SFT checkpoint)
- Preparing the NAVSIM dataset
- Setting up the DA3 depth model
All paths are configured in configs/sft_navsim/navsim.yaml. Before running, set the base_root field to the absolute path of this repository on your machine:
experiment:
base_root: '/path/to/UniWorldVLA' # <-- change thisAll other paths in the config are derived from base_root via OmegaConf interpolation. See SETUP.md for the expected directory layout.
cd UniWorldVLA
bash scripts/finetune/navsim/run_sft_navsim_baseline8.shThe script launches training/fine-tune_navsim.py via accelerate with the DeepSpeed ZeRO-2 config in accelerate_configs/8_gpus_deepspeed_zero2.yaml.
Key training options are controlled via environment variables (see training/fine-tune_navsim.py → configure_experiment_from_env):
| Variable | Default | Description |
|---|---|---|
MAX_TRAIN_STEPS |
160000 | Total training steps |
LEARNING_RATE |
1e-5 | Base learning rate |
BATCH_SIZE_TRAIN_NUS |
7 | Per-GPU batch size |
VIDEO_COEFF |
0.3 | Weight for video prediction loss |
TJ_COEFF |
1.0 | Weight for trajectory loss |
EVAL_ONLY |
0 | Set to 1 to run evaluation only |
LOCAL_RUN_PWM |
0 | Set to 1 for local debug mode (small subset) |
Example with custom settings:
MAX_TRAIN_STEPS=50000 LEARNING_RATE=3e-5 \
bash scripts/finetune/navsim/run_sft_navsim_baseline8.shLOCAL_RUN_PWM=1 bash scripts/finetune/navsim/run_sft_navsim_baseline8_local.shEVAL_ONLY=1 EVAL_FROM_CHECKPOINT=1 \
EVAL_DIR=/path/to/checkpoint \
bash scripts/finetune/navsim/run_sft_navsim_baseline8.shEvaluation computes:
- PDMS (PDM Score) for closed-loop planning
- FVD for video prediction quality
- Release arXiv paper
- Release code
- Release model weights
@article{liu2026uniworld,
title = {Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving},
author = {Liu, Qiqi and Xu, Huan and Li, Jingyu and Sun, Bin and Hao, Zhihui and She, Dangen and Zhu, Xiatian and Zhang, Li},
journal = {arXiv preprint arXiv:2603.27287},
year = {2026},
}


