mini-pi0 is a compact research codebase for training flow-matching robot
action policies on ManiSkill demonstrations. The current stack centers on
mini_pi0_fm: an image-conditioned action-chunk policy with transformer,
CNN1D, and UNet1D denoisers.
95.5% success over 200 eval episodes with the medium transformer + ViT policy.
Click the preview to open the MP4.
26.0% success over 50 eval episodes with the medium transformer + ViT policy.
| Base camera success | Wrist camera success |
|---|---|
![]() |
![]() |
Click a preview to open the MP4.
PegInsertionSide is the current hard task. The latest contact + hole-camera policy reaches 10.0% success over 100 eval episodes, with most remaining failures reaching the hole but timing out before stable insertion.
| Base camera success | Left hole camera success | Right hole camera success |
|---|---|---|
![]() |
![]() |
![]() |
Click a preview to open the MP4.
- ManiSkill simulation workflows for collection, replay, conversion, training, eval, deploy-sim, and diagnostics.
mini_pi0_fmflow-matching policy with:- action backbones:
transformer,cnn1d,unet1d - vision backbones:
resnet18,timm - multi-camera cross-attention conditioning
- observation history and chunked action prediction
- action backbones:
- Robomimic-style HDF5 loading for converted ManiSkill trajectories.
- Action diagnostics for comparing flow steps and per-dimension clipping.
- PegInsertion support with close hole cameras and optional contact features.
- IsaacLab adapter scaffold kept for future integration work.
mini_pi0/
cli/ # train/eval/convert/collect command entrypoints
config/ # typed dataclass config schema and YAML loading
dataset/ # HDF5 episode loading, conversion, torch datasets
sim/ # simulator adapters and custom ManiSkill environments
models/ # mini_pi0_fm model and registry
train/ # training loop, optimizer, checkpointing
eval/ # rollout eval, action diagnostics, grid videos
deploy/ # simulation deployment loop
utils/ # runtime/device/parity helpers
examples/configs/
maniskill3_stackcube_motionplanning_transformer_vit_hist2_medium.yaml
maniskill3_tray_transformer_vit_hist2_chunk16.yaml
maniskill3_peginsertion_motionplanning_transformer_vit_hist3_medium_holecam_contacts.yaml
This project is uv first. Use Python 3.11 for the ManiSkill + LeRobot stack.
uv venv --python 3.11 .venv
. .venv/bin/activate
uv pip install -e ".[maniskill3,vision,dev]"For LeRobot v3 support in the same environment as ManiSkill, install with the repo constraints so the simulator stack stays compatible:
uv pip install -e ".[maniskill3,lerobot,vision,dev]" \
-c constraints-maniskill-lerobot.txtPip fallback:
python -m venv .venv
. .venv/bin/activate
pip install -e ".[maniskill3,lerobot,vision,dev]" \
-c constraints-maniskill-lerobot.txtOn shared machines with read-only Hugging Face cache paths, set a writable cache:
export HF_HOME=/tmp/minipi_hf_cacheTrain the StackCube transformer + ViT policy:
mini-pi0 train \
--config examples/configs/maniskill3_stackcube_motionplanning_transformer_vit_hist2_medium.yamlEvaluate a checkpoint:
mini-pi0 eval \
--config examples/configs/maniskill3_stackcube_motionplanning_transformer_vit_hist2_medium.yaml \
--set eval.checkpoint=runs/<experiment>/run1/checkpoints/best.pt \
--set eval.action_stats_path=runs/<experiment>/run1/artifacts/action_stats.json \
--set eval.record_grid=true \
--set eval.grid_cameras='["base_camera","hand_camera"]'Run offline action diagnostics:
python -m mini_pi0.eval.action_diagnostics \
--config examples/configs/maniskill3_stackcube_motionplanning_transformer_vit_hist2_medium.yaml \
--checkpoint runs/<experiment>/run1/checkpoints/best.pt \
--action_stats runs/<experiment>/run1/artifacts/action_stats.json \
--flow_steps 4,6,8Download built-in ManiSkill demonstrations:
python -m mani_skill.utils.download_demo StackCube-v1 \
--output_dir demos/maniskillReplay demonstrations with RGBD observations and the target controller:
python -m mani_skill.trajectory.replay_trajectory \
--traj-path demos/maniskill/StackCube-v1/motionplanning/trajectory.h5 \
--obs-mode rgbd \
--target-control-mode pd_joint_pos \
--save-trajConvert the replayed trajectory into the training schema:
mini-pi0 convert-maniskill-trajectory \
--input_hdf5 demos/maniskill/StackCube-v1/motionplanning/trajectory.rgbd.pd_joint_pos.physx_cpu.h5 \
--output_hdf5 data/robomimic/maniskill/stackcube/mp/rgbd_pd_joint_pos.hdf5 \
--overwriteOr convert a replayed ManiSkill trajectory directly into LeRobot v3 format with ManiSkill's official converter:
python -m mani_skill.trajectory.convert_to_lerobot \
--traj-path demos/maniskill/StackCube-v1/motionplanning/trajectory.rgbd.pd_joint_pos.physx_cpu.h5 \
--output-dir data/lerobot/stackcube-rgbd-pd-joint-pos \
--task-name "Stack cube" \
--fps 20 \
--image-size 128x128 \
--robot-type pandaExisting robomimic HDF5 files can also be ported to LeRobot v3:
mini-pi0 convert-robomimic-to-lerobot \
--input_hdf5 data/robomimic/maniskill/stackcube/mp/rgbd_pd_joint_pos.hdf5 \
--output_dir data/lerobot/stackcube-rgbd-pd-joint-pos \
--repo_id local/stackcube-rgbd-pd-joint-pos \
--task_name "Stack cube" \
--fps 20 \
--image_keys agentview_image,robot0_eye_in_hand_image \
--overwriteTrain from LeRobot v3 by switching the dataset format:
mini-pi0 train \
--config examples/configs/maniskill3_stackcube_motionplanning_transformer_vit_hist2_medium.yaml \
--set data.format=lerobot_v3 \
--set data.lerobot_repo_id=local/stackcube-rgbd-pd-joint-pos \
--set data.lerobot_root=data/lerobot/stackcube-rgbd-pd-joint-pos \
--set data.lerobot_image_keys='["observation.images.agentview_image","observation.images.robot0_eye_in_hand_image"]'For full data notes, custom environment collection, and task conversion details, see docs/DATASETS.md and docs/SIMULATION.md.
PegInsertionSide is tracked separately because it needs better visual access to the hole and benefits from contact diagnostics. This branch includes:
MiniPi0PegInsertionSide-v1, a repo-local ManiSkill environment with closehole_left_cameraandhole_right_camerasensors.- Replay helpers for local env registration.
- Contact extraction into HDF5 under
obs/*keys. - A contact-aware config:
examples/configs/maniskill3_peginsertion_motionplanning_transformer_vit_hist3_medium_holecam_contacts.yaml.
Prepare the hole-camera contact dataset:
NUM_ENVS=16 tools/prepare_peginsertion_holecam_contacts.shTrain the current PegInsertion config:
mini-pi0 train \
--config examples/configs/maniskill3_peginsertion_motionplanning_transformer_vit_hist3_medium_holecam_contacts.yamlRead the task-specific notes in docs/PEG_INSERTION.md.
The main knobs are:
model.action_backbone:transformer,cnn1d, orunet1dmodel.vision_backbone:resnet18ortimmmodel.conditioning_mode:cross_attentionorglobalmodel.obs_horizon: number of observation framesdata.chunk_sizeandmodel.chunk_size: predicted action horizonrobot.image_keys: ordered camera observationsrobot.state_keys: proprio/contact state keyseval.grid_cameras: one or more cameras for saved rollout grids
When camera keys, state keys, action dimension, or chunk size change, retrain or use a checkpoint trained with the same interface.
Run focused checks:
python -m pytest \
tests/test_config.py \
tests/test_model_registry.py \
tests/test_fm_architecture.py \
tests/test_training_stability_controls.py \
tests/test_eval_weight_source.py \
-qRun the full suite:
python -m pytest -qFull task benchmark tracking lives in docs/TASK_BENCHMARK.md.
Config:
examples/configs/maniskill3_stackcube_motionplanning_transformer_vit_hist2_medium.yaml
| Metric | Value |
|---|---|
| Success rate | 95.5% |
| CI95 | 92.5% - 98.0% |
| Episodes | 200 |
| Mean episode length | 166.2 steps |
| Mean inference speed | 31.9 ms/chunk |
Artifacts:
Config:
examples/configs/maniskill3_stackpyramid_motionplanning_transformer_vit_hist2_medium.yaml
| Metric | Value |
|---|---|
| Success rate | 26.0% |
| CI95 | 14.0% - 38.0% |
| Episodes | 50 |
| Mean episode length | 444.9 steps |
| Mean inference speed | 32.6 ms/chunk |
| Failure modes | 37 no progress, 13 success |
Artifacts:
Variant:
pd_ee_delta_pose small policy with hist2 base + wrist cameras.
| Metric | Value |
|---|---|
| Success rate | 31.0% |
| CI95 | 23.0% - 41.0% |
| Episodes | 100 |
| Mean episode length | 408.4 steps |
| Mean inference speed | 27.7 ms/chunk |
| Failure modes | 69 timeout after progress, 31 success |
| Base camera success | Wrist camera success |
|---|---|
![]() |
![]() |
Artifacts:
Config:
examples/configs/maniskill3_peginsertion_motionplanning_transformer_vit_hist3_medium_holecam_contacts.yaml
| Metric | Value |
|---|---|
| Success rate | 10.0% |
| CI95 | 5.0% - 16.0% |
| Episodes | 100 |
| Mean episode length | 474.8 steps |
| Mean inference speed | 43.4 ms/chunk |
| Failure modes | 90 timeout after progress, 10 success |
Artifacts:
Current interpretation: the policy now solves a small but real fraction of rollouts. Most failures still make progress toward the hole, so the next useful step is phase diagnostics: grasp state, peg-hole distance, angular alignment, insertion depth, contact force, and jamming signals.
For the task-wise benchmark table, see docs/TASK_BENCHMARK.md.
- Train a stable FM transformer + ViT policy on StackCube motion-planning data.
- Add PegInsertionSide close hole cameras and contact-feature conversion.
- Save multi-camera eval grids with
eval.grid_cameras. - Add task-specific failure diagnostics for insertion and contact-rich tasks: grasp state, peg-hole distance, angular alignment, insertion depth, contact force, and jamming detection.
- Improve PegInsertionSide with better camera placement, richer contact conditioning, and phase-aware evaluation metrics.
- Add domain randomization presets for all ManiSkill tasks once the baseline imitation-learning setup is stable.
- Add LeRobot v3 dataset support for scalable multi-task training across larger shared robot datasets.
- Add multi-GPU training for larger ViT backbones, more cameras, and larger task mixtures.
- Add RL fine-tuning after imitation learning. The immediate target is to warm
start from
mini_pi0_fmcheckpoints, then optimize task success and contact robustness with environment rewards while constraining policy drift from the demonstration policy. - Expand to more ManiSkill tasks and eventually reuse the same policy stack with the IsaacLab adapter.













