View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning

Introduction

APO, a novel framework aimed at bolstering the reasoning capabilities of multimodal large language models, is presented in this work.

Mitigates "overthinking":

The work tackles the issue of MLLMs generating excessively long and often incorrect reasoning paths by introducing Suboptimal Trajectory Complexity Regularization (STCR).
This method penalizes overly long responses for incorrect predictions, encouraging clearer and more concise reasoning.

Balances reasoning improvement with "Training stability":

The work introduces Difficulty-Adaptive Divergence Shaping (DADS) to dynamically adjust the KL divergence penalty during RL training.
This allows the model to explore more effectively on difficult reasoning tasks while preserving its original knowledge and performance on general tasks, thus improving training stability and sample utilization.

Experiment

Quick Start

Installation

git clone https://github.com/Indolent-Kawhi/View-R1
cd View-R1
uv venv viewr1 --python 3.11
source viewr1/bin/activate
uv pip install -e ".[vllm]"
uv pip install flash_attn --no-build-isolation
uv pip install flashinfer-python

Datasets format:

[
  {
    "message":"[
      {
        \"role\": \"user\",
        \"content\": [
            { \
                \"type\": \"image\",
                \"image\": \"file:///path/to/your/image.jpg\",
            }, \
            {\"type\": \"text\", \"text\": \"How many points in the photo?\"},
        ],
      }
    ]",
    "answer": "$42$"
  },
]

Training

bash examples/scripts/train_grpo.sh

Use --use_stcr and --use_dads to control whether STCR and DADS are used. This script does not shuffle the dataset by default.

Acknowledgements

Our project is based on OpenRLHF, LMM-R1 and Observe-R1.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
examples/scripts		examples/scripts
openrlhf		openrlhf
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning

Introduction

Mitigates "overthinking":

Balances reasoning improvement with "Training stability":

Experiment

Quick Start

Installation

Datasets format:

Training

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

View-R1: Asymmetric Policy Optimization for Difficulty-Aware Multimodal Reinforcement Learning

Introduction

Mitigates "overthinking":

Balances reasoning improvement with "Training stability":

Experiment

Quick Start

Installation

Datasets format:

Training

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages