OGPO

Off-Policy Generative Policy Optimization — online RL fine-tuning of flow-matching policies, with a suite of baseline algorithms, on Robomimic, D4RL/Kitchen, Adroit, Push-T, and OGBench environments.

Policies are flow-matching generators rather than simple Gaussians, enabling multimodal action distributions for manipulation. Every method follows the same pipeline — offline BC pretraining → online RL fine-tuning — and differs in how it does the RL part.

Install

# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# CUDA 12
uv sync --extra cuda12 --extra robomimic

# or CUDA 13
uv sync --extra cuda13 --extra robomimic

Drop --extra robomimic if you are not running the Robomimic environments (square, transport, tool_hang). make sync is a shortcut for the CUDA 12 line.

Note: We use robosuite version 1.5.x for all our experiments. Please download the datasets to a directory of your choice. Robomimic datasets are loaded from ~/.robomimic by default. If your downloaded datasets live elsewhere, point ROBOMIMIC_DATASET_ROOT at them (the loader expects $ROBOMIMIC_DATASET_ROOT/<task>/<dataset_type>/<file>.hdf5):
export ROBOMIMIC_DATASET_ROOT=/path/to/robomimic

Quickstart

# OGPO on square (low-dim)
bash scripts/ogpo/square.sh

# Override any hyperparameter inline
bash scripts/ogpo/square.sh --seed=42 --online_steps=500000

# Other algorithms
bash scripts/bptt/square.sh
bash scripts/qc/square.sh
bash scripts/dsrl/square.sh
bash scripts/expo/square.sh
bash scripts/fql/square.sh

Runs log to Weights & Biases. Set WANDB_API_KEY in your environment to log online; without it, W&B falls back to offline mode. Experiment data (per-run flags.json and CSV logs) is written under exp/ and checkpoints under ./checkpoints/. Relocate either root with an environment variable — e.g. to point both at fast scratch storage on a cluster:

export OGPO_EXP_ROOT=/scratch/$USER/ogpo/exp          # experiment data + CSV logs
export OGPO_CHECKPOINT_ROOT=/scratch/$USER/ogpo/ckpt  # checkpoints

Algorithms

Script dir	Method	Idea
`ogpo/`	OGPO	PPO-style policy gradient on SDE-fied flow policies via advantages computed using GRPO-style mean(Q) value subtraction over a group of sampled actions.
`bptt/`	BPTT	Differentiate Q through the ODE sampled actions into the actor.
`qc/`	QC / ACFQL	Action-chunked FQL; BON + SFT from Q.
`dsrl/`	DSRL	SAC in the initial noise space of a frozen pretrained flow policy.
`expo/`	EXPO	Learned Gaussian residuals on top of a frozen pretrained flow policy.
`dsrl_plus_expo/`	DSRL+EXPO	DSRL noise actor + EXPO edit layer.
`fql/`	FQL	OGPO with one-step distillation (`--agent.use_one_step_policy=true`).
`ogpo_awr/`, `ogpo_fpo/`, `mipq/`	OGPO ablations	Advantage-weighting (AWR, AW-OGPO, AW-OGPO-NN) / Offpolicy FPO/FPO++ (OFPO, OFPO++) / MIP Q variants via `--agent.*` flags.

Composing experiments

You rarely need to edit a script. Every config variable at the top of a script is overridable from the command line, so experiments compose as argument patterns:

bash scripts/ogpo/square.sh --seed=42 --actor_lr=4.5e-5 --adv_strategy=conservative \
    --run_group=square_conservative_adv

See scripts/README.md for the full override pattern, the environment list, how to call ogpo/main.py directly, and micro-tutorials for wrapping these scripts for your own SLURM / SageMaker / container cluster.

Repository layout

ogpo/        core package — agents, networks, runners, configs, training entry point
envs/        environment + dataset wrappers (Robomimic, D4RL, Adroit, Push-T, OGBench)
scripts/     one runnable training script per (algorithm, environment)

Configuration defaults live in ogpo/configs/algos/ (common.yaml + per-algo overrides); the training entry point is ogpo/main.py (ogpo-train).

Image observations (frozen VL encoder)

Image-based runs use a frozen PaliGemma (SigLIP vision tower + Gemma language trunk) encoder loaded from a Pi0/Pi0.5 checkpoint. Point --checkpoint.restore_encoder_path at your local checkpoint directory (the *_pg_libero.sh / *_image_paligemma.sh scripts show the expected layout and a gcloud storage download command). State defaults to robot proprioception only (--agent.use_state=proprio).

Citation

If you use this code, please cite:

@article{patil2026ogpo,
  title={OGPO: Sample Efficient Full-Finetuning of Generative Control Policies},
  author={Patil, Sarvesh and Nakamoto, Mitsuhiko and Agarwal, Manan and Saxena, Shashwat and Zhang, Jesse and Anantharaman, Giri and Winston, Cleah and Pan, Chaoyi and Chen, Douglas and Huang, Nai-Chieh and others},
  journal={arXiv preprint arXiv:2605.03065},
  year={2026}
}

License

MIT (see pyproject.toml).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
envs		envs
ogpo		ogpo
scripts		scripts
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OGPO

Install

Quickstart

Algorithms

Composing experiments

Repository layout

Image observations (frozen VL encoder)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OGPO

Install

Quickstart

Algorithms

Composing experiments

Repository layout

Image observations (frozen VL encoder)

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages