Skip to content

Tsinghua-MARS-Lab/FluidTest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FluidTest

Official website: https://fluidtest.web.app/

FluidTest is the testing benchmark and leaderboard for measuring long-tail end-to-end autonomous-driving safety.

Background

End-to-end autonomous driving benchmarks have become increasingly saturated: models can score well on routine displacement and imitation metrics while still behaving unsafely in rare, high-stakes scenes. FluidTest addresses this gap by grounding evaluation in human safety preference. Labelers compare planner trajectories against expert driving logs and judge whether a system introduces additional threats beyond what a human driver would accept. The goal is to make safety in edge cases measurable, comparable, and aligned with how people actually weigh risk.

It evaluates planner trajectories in challenging scenarios and focuses on whether a model introduces additional threats relative to the expert trajectory. The warmup stage currently runs on the WOD-E2E dataset, and the public preflight flow is tied to the FluidTest benchmark and leaderboard.

This repository is a slim release for the benchmark warmup workflow. It currently contains the UniPlan-lite warmup baseline, Val151 subset filtering, and preflight submission helpers used to interact with the FluidTest benchmark.

Contents

  • src/fluidtest_baseline/uniplan.py: self-contained UniPlan-lite warmup baseline for benchmark submissions.
  • src/fluidtest_baseline/submission.py: prediction JSON/JSONL validation and Val151 selected-subset filtering for benchmark preflight.
  • tests/: focused tests for the warmup utility surface shipped in this repo.
  • docs/preflight_val151_upload_test.md: current benchmark preflight upload and deploy test result.
  • environment.yml: Conda environment for a fresh benchmark setup.
  • docs/conda_smoke_test.md: portable fresh-Conda smoke workflow and expected checks.
  • docs/milestone_tracker.md: task tracker and current validation status.

The benchmark website, leaderboard frontend, GRPO training code, and broader research workspace are not included here.

Setup

git clone https://github.com/Tsinghua-MARS-Lab/FluidTest.git
cd FluidTest
conda env create -f environment.yml
conda activate fluidtest-benchmark
python -m pytest

If Conda is not initialized in your shell yet, run your Conda installation's shell hook first. Common examples are:

source "$HOME/miniconda3/etc/profile.d/conda.sh"
source "$HOME/miniforge3/etc/profile.d/conda.sh"

You can also use a virtualenv instead:

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install --index-url https://download.pytorch.org/whl/cpu --no-deps torch
python -m pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec
python -m pip install -e ".[test]"

torch is intentionally installed through the environment setup instead of being hidden inside the package metadata, because the correct build depends on your platform and whether you want CPU-only or CUDA. The supplied Conda environment uses the default PyTorch package. For a CUDA-enabled environment, install the matching PyTorch build inside the Conda env first, then run:

python -m pip install -e ".[test]"

Warmup Stage

The current warmup stage uses the WOD-E2E dataset so teams can train and test against the FluidTest benchmark workflow. The official challenge dataset will keep the same data format and API, but it will be a different dataset. Teams must retrain their models on the official dataset when it is released.

Expected Data Format

Training and validation files are JSONL. Each line is one scenario with:

  • sample_id
  • ego fields such as current_vel, current_accel, past_traj, and optional intent
  • a 5-point or longer future trajectory under future_traj, future_trajectory, gt_trajectory, or target_trajectory

Example output prediction row:

{"sample_id":"example-scenario","future_trajectory":[[1.0,0.0],[2.1,0.1],[3.2,0.2],[4.4,0.3],[5.8,0.4]],"confidence":0.91}

future_trajectory must contain exactly five [x, y] points.

Train UniPlan-Lite

fluidtest-baseline train-uniplan \
  --train-jsonl data/wod-e2e/train/train_samples.jsonl \
  --val-jsonl data/wod-e2e/val/val_samples.jsonl \
  --output-dir runs/uniplan-warmup \
  --device cuda \
  --epochs 3 \
  --batch-size 64

CPU smoke run:

fluidtest-baseline train-uniplan \
  --train-jsonl data/wod-e2e/train/train_samples.jsonl \
  --val-jsonl data/wod-e2e/val/val_samples.jsonl \
  --output-dir runs/uniplan-smoke \
  --device cpu \
  --epochs 1 \
  --batch-size 8 \
  --max-train-samples 128 \
  --max-val-samples 32 \
  --num-workers 0

The training run writes best.pt, last.pt, and history.json under --output-dir.

Infer On WOD-E2E Val

fluidtest-baseline infer-uniplan \
  --checkpoint runs/uniplan-warmup/best.pt \
  --samples-jsonl data/wod-e2e/val/val_samples.jsonl \
  --output-jsonl runs/submissions/uniplan_val_predictions.jsonl \
  --summary-json runs/submissions/uniplan_val_summary.json \
  --device cuda \
  --batch-size 128

The preflight server accepts a full validation prediction JSONL. If your file contains more scenes than the preflight subset, only the Val151 subset is used.

To produce a local filtered subset with your own selected-ID file:

fluidtest-baseline filter-val-subset \
  --predictions-jsonl runs/submissions/uniplan_val_predictions.jsonl \
  --sample-ids-json configs/val151_sample_ids.json \
  --output-jsonl runs/submissions/uniplan_val151_predictions.jsonl \
  --summary-json runs/submissions/uniplan_val151_summary.json

Submit To Preflight

Open https://fluidtest-sigma.web.app/preflight_label.

  1. Enter team name and model name.
  2. Upload or paste the inference JSONL.
  3. Click Validate.
  4. Click Submit to preflight.
  5. Inspect the read-only result preview.

The result page shows only the 151 Val151 scenarios and the projected trajectories. It intentionally has no label buttons, matching the information surface shown to official labelers.

Another stronger baseline, Poutine, will be released soon.

Tests

python -m pytest

Fresh Conda Smoke

The repeatable smoke script creates a fresh Conda environment from environment.yml, installs this package, runs pytest, runs a small train smoke, runs inference from that smoke checkpoint, and filters an existing full-val prediction file down to Val151.

Inputs:

  • TRAIN_JSONL: full training JSONL from the benchmark data bundle.
  • VAL_JSONL: full validation JSONL from the benchmark data bundle.
  • VAL151_JSONL: validation JSONL filtered to the selected Val151 scenarios.
  • REFERENCE_PREDICTIONS: any valid prediction JSONL covering the Val151 IDs. It is used only to smoke-test filtering, so a generated GT-copy reference is acceptable for this check and must not be reported as a model submission.

If you only have the full validation JSONL, generate the derived Val151 and reference files first:

python scripts/prepare_smoke_inputs.py \
  --val-jsonl /path/to/val_samples.jsonl \
  --sample-ids-json configs/val151_sample_ids.json \
  --train-jsonl /path/to/train_samples.jsonl \
  --output-dir data/generated
source data/generated/smoke_inputs.env
export TRAIN_JSONL=/path/to/train_samples.jsonl
export VAL_JSONL=/path/to/val_samples.jsonl
export VAL151_JSONL=/path/to/val151_samples.jsonl
export REFERENCE_PREDICTIONS=/path/to/full_val_predictions.jsonl
bash scripts/conda_smoke.sh

VAL151_IDS_JSON defaults to configs/val151_sample_ids.json, but you can override it if your benchmark data bundle provides a different selected-ID file. The script auto-detects conda; set CONDA_EXE=/path/to/conda if it is not on PATH. Set ENV_NAME or ENV_FILE to override the generated environment name or Conda environment file.

Logs are written under logs/.

About

FluidTest - The long-tail AV safety benchmark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors