Generative RLHF-V: Learning Principles from Multi-modal Human Preference

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional scalar reward models for aligning multimodal large language models (MLLMs) suffer from low accuracy, poor generalization, and limited interpretability, hindering the effectiveness of RLHF and related methods. To address these limitations, we propose the first framework integrating generative reward modeling with multimodal RLHF, introducing a novel two-stage paradigm: (1) reinforcement learning–driven multimodal generative reward modeling to enhance preference understanding; and (2) RL optimization based on grouped response comparison, yielding learnable, interpretable, and strongly generalizable reward signals. Our approach overcomes the pairwise discrimination constraint and scales nearly linearly with increasing numbers of candidate responses. Evaluated across seven benchmarks, it improves the average performance of four mainstream MLLMs by 18.1%—significantly surpassing baseline RLHF (5.3%)—while substantially enhancing out-of-distribution generalization and scoring accuracy.

Technology Category

Application Category

📝 Abstract
Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: $ extbf{multi-modal generative reward modeling from RL}$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and $ extbf{RL optimization from grouped comparison}$, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1%$, while the baseline RLHF is only $5.3%$. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.
Problem

Research questions and friction points this paper is trying to address.

Improving alignment of multi-modal LLMs with human intentions
Overcoming low accuracy in traditional reward models
Enhancing generalization and interpretability of RLHF methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates GRMs with multi-modal RLHF
Uses RL to guide GRMs actively
Enhances scoring via grouped comparison
🔎 Similar Papers
No similar papers found.
J
Jiayi Zhou
Peking University
J
Jiaming Ji
Peking University
B
Boyuan Chen
Peking University
J
Jiapeng Sun
University College London
W
Wenqi Chen
Peking University
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology
Y
Yaodong Yang
Peking University