Official website: https://fluidtest.web.app/
FluidTest is the testing benchmark and leaderboard for measuring long-tail end-to-end autonomous-driving safety.
End-to-end autonomous driving benchmarks have become increasingly saturated: models can score well on routine displacement and imitation metrics while still behaving unsafely in rare, high-stakes scenes. FluidTest addresses this gap by grounding evaluation in human safety preference. Labelers compare planner trajectories against expert driving logs and judge whether a system introduces additional threats beyond what a human driver would accept. The goal is to make safety in edge cases measurable, comparable, and aligned with how people actually weigh risk.
It evaluates planner trajectories in challenging scenarios and focuses on whether a model introduces additional threats relative to the expert trajectory. The warmup stage currently runs on the WOD-E2E dataset, and the public preflight flow is tied to the FluidTest benchmark and leaderboard.
This repository is a slim release for the benchmark warmup workflow. It currently contains the UniPlan-lite warmup baseline, Val151 subset filtering, and preflight submission helpers used to interact with the FluidTest benchmark.
src/fluidtest_baseline/uniplan.py: self-contained UniPlan-lite warmup baseline for benchmark submissions.src/fluidtest_baseline/submission.py: prediction JSON/JSONL validation and Val151 selected-subset filtering for benchmark preflight.tests/: focused tests for the warmup utility surface shipped in this repo.docs/preflight_val151_upload_test.md: current benchmark preflight upload and deploy test result.environment.yml: Conda environment for a fresh benchmark setup.docs/conda_smoke_test.md: portable fresh-Conda smoke workflow and expected checks.docs/milestone_tracker.md: task tracker and current validation status.
The benchmark website, leaderboard frontend, GRPO training code, and broader research workspace are not included here.
git clone https://github.com/Tsinghua-MARS-Lab/FluidTest.git
cd FluidTest
conda env create -f environment.yml
conda activate fluidtest-benchmark
python -m pytestIf Conda is not initialized in your shell yet, run your Conda installation's shell hook first. Common examples are:
source "$HOME/miniconda3/etc/profile.d/conda.sh"
source "$HOME/miniforge3/etc/profile.d/conda.sh"You can also use a virtualenv instead:
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install --index-url https://download.pytorch.org/whl/cpu --no-deps torch
python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec
python -m pip install -e ".[test]"torch is intentionally installed through the environment setup instead of being hidden inside the package metadata, because the correct build depends on your platform and whether you want CPU-only or CUDA. The supplied Conda environment uses the default PyTorch package. For a CUDA-enabled environment, install the matching PyTorch build inside the Conda env first, then run:
python -m pip install -e ".[test]"The current warmup stage uses the WOD-E2E dataset so teams can train and test against the FluidTest benchmark workflow. The official challenge dataset will keep the same data format and API, but it will be a different dataset. Teams must retrain their models on the official dataset when it is released.
Training and validation files are JSONL. Each line is one scenario with:
sample_id- ego fields such as
current_vel,current_accel,past_traj, and optionalintent - a 5-point or longer future trajectory under
future_traj,future_trajectory,gt_trajectory, ortarget_trajectory
Example output prediction row:
{"sample_id":"example-scenario","future_trajectory":[[1.0,0.0],[2.1,0.1],[3.2,0.2],[4.4,0.3],[5.8,0.4]],"confidence":0.91}future_trajectory must contain exactly five [x, y] points.
fluidtest-baseline train-uniplan \
--train-jsonl data/wod-e2e/train/train_samples.jsonl \
--val-jsonl data/wod-e2e/val/val_samples.jsonl \
--output-dir runs/uniplan-warmup \
--device cuda \
--epochs 3 \
--batch-size 64CPU smoke run:
fluidtest-baseline train-uniplan \
--train-jsonl data/wod-e2e/train/train_samples.jsonl \
--val-jsonl data/wod-e2e/val/val_samples.jsonl \
--output-dir runs/uniplan-smoke \
--device cpu \
--epochs 1 \
--batch-size 8 \
--max-train-samples 128 \
--max-val-samples 32 \
--num-workers 0The training run writes best.pt, last.pt, and history.json under --output-dir.
fluidtest-baseline infer-uniplan \
--checkpoint runs/uniplan-warmup/best.pt \
--samples-jsonl data/wod-e2e/val/val_samples.jsonl \
--output-jsonl runs/submissions/uniplan_val_predictions.jsonl \
--summary-json runs/submissions/uniplan_val_summary.json \
--device cuda \
--batch-size 128The preflight server accepts a full validation prediction JSONL. If your file contains more scenes than the preflight subset, only the Val151 subset is used.
To produce a local filtered subset with your own selected-ID file:
fluidtest-baseline filter-val-subset \
--predictions-jsonl runs/submissions/uniplan_val_predictions.jsonl \
--sample-ids-json configs/val151_sample_ids.json \
--output-jsonl runs/submissions/uniplan_val151_predictions.jsonl \
--summary-json runs/submissions/uniplan_val151_summary.jsonOpen https://fluidtest-sigma.web.app/preflight_label.
- Enter team name and model name.
- Upload or paste the inference JSONL.
- Click
Validate. - Click
Submit to preflight. - Inspect the read-only result preview.
The result page shows only the 151 Val151 scenarios and the projected trajectories. It intentionally has no label buttons, matching the information surface shown to official labelers.
Another stronger baseline, Poutine, will be released soon.
python -m pytestThe repeatable smoke script creates a fresh Conda environment from environment.yml, installs this package, runs pytest, runs a small train smoke, runs inference from that smoke checkpoint, and filters an existing full-val prediction file down to Val151.
Inputs:
TRAIN_JSONL: full training JSONL from the benchmark data bundle.VAL_JSONL: full validation JSONL from the benchmark data bundle.VAL151_JSONL: validation JSONL filtered to the selected Val151 scenarios.REFERENCE_PREDICTIONS: any valid prediction JSONL covering the Val151 IDs. It is used only to smoke-test filtering, so a generated GT-copy reference is acceptable for this check and must not be reported as a model submission.
If you only have the full validation JSONL, generate the derived Val151 and reference files first:
python scripts/prepare_smoke_inputs.py \
--val-jsonl /path/to/val_samples.jsonl \
--sample-ids-json configs/val151_sample_ids.json \
--train-jsonl /path/to/train_samples.jsonl \
--output-dir data/generated
source data/generated/smoke_inputs.envexport TRAIN_JSONL=/path/to/train_samples.jsonl
export VAL_JSONL=/path/to/val_samples.jsonl
export VAL151_JSONL=/path/to/val151_samples.jsonl
export REFERENCE_PREDICTIONS=/path/to/full_val_predictions.jsonl
bash scripts/conda_smoke.shVAL151_IDS_JSON defaults to configs/val151_sample_ids.json, but you can override it if your benchmark data bundle provides a different selected-ID file. The script auto-detects conda; set CONDA_EXE=/path/to/conda if it is not on PATH. Set ENV_NAME or ENV_FILE to override the generated environment name or Conda environment file.
Logs are written under logs/.