RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

📅 2024-08-22

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large vision-language models (LVLMs) face significant challenges in aligning with human preferences due to the scarcity of high-quality visual preference data. Method: This paper proposes Robust Vision Reward Modeling (RoVRM), introducing a novel three-stage progressive training framework and an optimal-transport-based cross-modal preference data selection mechanism—enabling, for the first time, effective transfer of textual preference data to visual reward modeling. The technical pipeline integrates vision–language cross-modal reward modeling, preference data distillation, and alignment optimization. Results: Experiments on LLaVA-1.5-7B/13B demonstrate that RoVRM substantially outperforms conventional vision reward models. Compared to direct preference optimization and other ranking-based methods, RoVRM achieves superior and more stable improvements in both visual generation quality and factual consistency.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the scarcity of visual preference data, which is required to train a visual reward model (VRM). In this work, we continue the line of research. We present a Robust Visual Reward Model (RoVRM) which improves human-preference alignment for LVLMs. RoVRM leverages auxiliary textual preference data through a three-phase progressive training and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data. We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Misleading Content Generation

Insufficient Image Preference Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust Visual Reward Model

Visual Language Model Improvement

Prefence Alignment

🔎 Similar Papers

No similar papers found.