Sashuai Zhou1,2*, Qiang Zhou2*, Jijin Hu2*, Hanqing Yang2*, Yue Cao3, Junpeng Ma4,
Yinchao Ma2, Jun Song2†, Tiezheng Ge2, Cheng Yu2, Bo Zheng2, Zhou Zhao1†
1Zhejiang University 2Alibaba Group 3Nanjing University 4Fudan University
* Equal contribution † Corresponding authors
Unified Thinker is a task-agnostic reasoning core for general image generation. It decouples a trainable Thinker (MLLM) from an image Generator (e.g., diffusion models), enabling executable planning that bridges the persistent reasoning–execution gap in reasoning-driven image generation and editing.
- Paper is now available.
- [Planned] Code / checkpoints / HieraReason-40K will be released soon. Stay tuned.
- Decoupled Thinker–Generator design: upgrade reasoning without retraining the entire generator.
- Unified planning format across T2I (creation) and I2I (edit-only modification).
- HieraReason-40K: hierarchical reasoning traces + executable enhanced prompts for cold start.
- Dual-phase RL with generator-in-the-loop to align plans with actual visual outcomes.
- Cross-generator transfer: Thinker can be plugged into different diffusion backbones.
This repository currently serves as the project homepage.
- Training & inference code
- Model checkpoints (Thinker / Generator adapters)
- HieraReason-40K data & processing scripts
- Reproduction scripts for benchmarks
If you would like to be notified when releases happen, please watch this repo.
Thinker (MLLM)
Input: instruction (+ optional reference image)
Output: structured reasoning trace + executable visual specification (enhanced prompt)
Generator (Diffusion model)
Input: enhanced prompt/spec (+ optional reference image for editing)
Output: final image
Training:
- Stage 1 — Joint Supervised Fine-Tuning
- Teach the Thinker the planning interface using HieraReason-40K
- Align Generator to the enhanced prompts
- Stage 2 — Dual-Phase Reinforcement Learning
- Phase 2.1 (Thinker RL): select plans that yield better images under constraint-based rewards
- Phase 2.2 (Generator RL): improve execution fidelity with stochastic rollouts + relative advantages
If you find this work useful, please cite:
@misc{zhou2026unifiedthinker,
title={Unified Thinker: A General Reasoning Modular Core for Image Generation},
author={Sashuai Zhou and Qiang Zhou and Jijin Hu and Hanqing Yang and Yue Cao and Junpeng Ma and Yinchao Ma and Jun Song and Tiezheng Ge and Cheng Yu and Bo Zheng and Zhou Zhao},
year={2026},
eprint={2601.03127},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.03127},
}