RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

📅 2024-08-22
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) face significant challenges in aligning with human preferences due to the scarcity of high-quality visual preference data. Method: This paper proposes Robust Vision Reward Modeling (RoVRM), introducing a novel three-stage progressive training framework and an optimal-transport-based cross-modal preference data selection mechanism—enabling, for the first time, effective transfer of textual preference data to visual reward modeling. The technical pipeline integrates vision–language cross-modal reward modeling, preference data distillation, and alignment optimization. Results: Experiments on LLaVA-1.5-7B/13B demonstrate that RoVRM substantially outperforms conventional vision reward models. Compared to direct preference optimization and other ranking-based methods, RoVRM achieves superior and more stable improvements in both visual generation quality and factual consistency.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the scarcity of visual preference data, which is required to train a visual reward model (VRM). In this work, we continue the line of research. We present a Robust Visual Reward Model (RoVRM) which improves human-preference alignment for LVLMs. RoVRM leverages auxiliary textual preference data through a three-phase progressive training and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data. We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Misleading Content Generation
Insufficient Image Preference Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust Visual Reward Model
Visual Language Model Improvement
Prefence Alignment
🔎 Similar Papers
No similar papers found.
C
Chenglong Wang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yang Gan
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yifu Huo
Yifu Huo
Northeastern University
Yongyu Mu
Yongyu Mu
Northeastern University
multilingualismmachine translationefficient models
M
Murun Yang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Qiaozhi He
Qiaozhi He
ByteDance
LLMNatural Language Processing
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
C
Chunliang Zhang
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
T
Tongran Liu
CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
Q
Quan Du
NiuTrans Research, Shenyang, China
D
Di Yang
NiuTrans Research, Shenyang, China
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing