Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

πŸ“… 2025-03-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large Vision-Language Models (LVLMs) heavily rely on labor-intensive human preference annotations and task-specific reward models for alignment. Method: This paper proposes a human-annotation-free, vision-guided reinforcement learning framework. Its core innovations are: (1) the first vision-feedback-driven R1-style RL algorithm, where raw images serve as intrinsic supervisory signals; (2) a generalizable, multi-dimensional task-logic reward function jointly measuring accuracy, consistency, and instruction adherence; and (3) a dynamic progressive rule optimization mechanism to effectively mitigate reward hacking. Results: Evaluated on a 7B LVLM, the method achieves comprehensive performance gainsβ€”key metrics improve by up to 50%β€”and substantially outperforms state-of-the-art models with ten times its parameter count.

Technology Category

Application Category

πŸ“ Abstract
Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.
Problem

Research questions and friction points this paper is trying to address.

Eliminates need for human-annotated preference data
Develops vision-guided reinforcement learning for LVLMs
Improves model performance without specialized reward models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-guided R1-like reinforcement learning algorithm
Criterion-driven reward function with multi-dimensional feedback
Progressive rule refinement strategy for dynamic adjustment
πŸ”Ž Similar Papers
Yufei Zhan
Yufei Zhan
Institute of Automation, Chinese Academy of Science
Computer VisionLarge Multimodal ModelsGrounding and Detection
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
S
Shurong Zheng
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
H
Hongyin Zhao
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
F
Fan Yang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China; Wuhan AI Research, Wuhan, China