Off-Policy Generative Policy Optimization — online RL fine-tuning of flow-matching policies, with a suite of baseline algorithms, on Robomimic, D4RL/Kitchen, Adroit, Push-T, and OGBench environments.
Policies are flow-matching generators rather than simple Gaussians, enabling multimodal action distributions for manipulation. Every method follows the same pipeline — offline BC pretraining → online RL fine-tuning — and differs in how it does the RL part.
# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# CUDA 12
uv sync --extra cuda12 --extra robomimic
# or CUDA 13
uv sync --extra cuda13 --extra robomimicDrop --extra robomimic if you are not running the Robomimic environments
(square, transport, tool_hang). make sync is a shortcut for the CUDA 12 line.
Note: We use robosuite version 1.5.x for all our experiments. Please download the datasets to a directory of your choice. Robomimic datasets are loaded from
~/.robomimicby default. If your downloaded datasets live elsewhere, pointROBOMIMIC_DATASET_ROOTat them (the loader expects$ROBOMIMIC_DATASET_ROOT/<task>/<dataset_type>/<file>.hdf5):export ROBOMIMIC_DATASET_ROOT=/path/to/robomimic
# OGPO on square (low-dim)
bash scripts/ogpo/square.sh
# Override any hyperparameter inline
bash scripts/ogpo/square.sh --seed=42 --online_steps=500000
# Other algorithms
bash scripts/bptt/square.sh
bash scripts/qc/square.sh
bash scripts/dsrl/square.sh
bash scripts/expo/square.sh
bash scripts/fql/square.shRuns log to Weights & Biases. Set WANDB_API_KEY in your environment to log
online; without it, W&B falls back to offline mode. Experiment data (per-run
flags.json and CSV logs) is written under exp/ and checkpoints under
./checkpoints/. Relocate either root with an environment variable — e.g. to
point both at fast scratch storage on a cluster:
export OGPO_EXP_ROOT=/scratch/$USER/ogpo/exp # experiment data + CSV logs
export OGPO_CHECKPOINT_ROOT=/scratch/$USER/ogpo/ckpt # checkpoints| Script dir | Method | Idea |
|---|---|---|
ogpo/ |
OGPO | PPO-style policy gradient on SDE-fied flow policies via advantages computed using GRPO-style mean(Q) value subtraction over a group of sampled actions. |
bptt/ |
BPTT | Differentiate Q through the ODE sampled actions into the actor. |
qc/ |
QC / ACFQL | Action-chunked FQL; BON + SFT from Q. |
dsrl/ |
DSRL | SAC in the initial noise space of a frozen pretrained flow policy. |
expo/ |
EXPO | Learned Gaussian residuals on top of a frozen pretrained flow policy. |
dsrl_plus_expo/ |
DSRL+EXPO | DSRL noise actor + EXPO edit layer. |
fql/ |
FQL | OGPO with one-step distillation (--agent.use_one_step_policy=true). |
ogpo_awr/, ogpo_fpo/, mipq/ |
OGPO ablations | Advantage-weighting (AWR, AW-OGPO, AW-OGPO-NN) / Offpolicy FPO/FPO++ (OFPO, OFPO++) / MIP Q variants via --agent.* flags. |
You rarely need to edit a script. Every config variable at the top of a script is overridable from the command line, so experiments compose as argument patterns:
bash scripts/ogpo/square.sh --seed=42 --actor_lr=4.5e-5 --adv_strategy=conservative \
--run_group=square_conservative_advSee scripts/README.md for the full override pattern, the
environment list, how to call ogpo/main.py directly, and micro-tutorials for
wrapping these scripts for your own SLURM / SageMaker / container cluster.
ogpo/ core package — agents, networks, runners, configs, training entry point
envs/ environment + dataset wrappers (Robomimic, D4RL, Adroit, Push-T, OGBench)
scripts/ one runnable training script per (algorithm, environment)
Configuration defaults live in ogpo/configs/algos/ (common.yaml + per-algo
overrides); the training entry point is ogpo/main.py (ogpo-train).
Image-based runs use a frozen PaliGemma (SigLIP vision tower + Gemma language
trunk) encoder loaded from a Pi0/Pi0.5 checkpoint. Point
--checkpoint.restore_encoder_path at your local checkpoint directory (the
*_pg_libero.sh / *_image_paligemma.sh scripts show the expected layout and a
gcloud storage download command). State defaults to robot proprioception only
(--agent.use_state=proprio).
If you use this code, please cite:
@article{patil2026ogpo,
title={OGPO: Sample Efficient Full-Finetuning of Generative Control Policies},
author={Patil, Sarvesh and Nakamoto, Mitsuhiko and Agarwal, Manan and Saxena, Shashwat and Zhang, Jesse and Anantharaman, Giri and Winston, Cleah and Pan, Chaoyi and Chen, Douglas and Huang, Nai-Chieh and others},
journal={arXiv preprint arXiv:2605.03065},
year={2026}
}MIT (see pyproject.toml).