VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses two critical bottlenecks in training vision-language reward models (VL-RMs): (1) the “guidance dilemma” arising from reliance on powerful vision-language models to generate high-quality preference data, and (2) modality bias and negative sample amplification caused by multimodal hallucinations. We propose the first iterative training framework integrating vision-expert guidance, chain-of-thought (CoT)-structured critique, and margin-based rejection sampling—breaking the bootstrapping loop while explicitly suppressing visual-attribute hallucinations and modality bias. Through multiple rounds of preference data refinement, our method significantly improves VL-RM performance: +12.3% in hallucination detection accuracy and +9.8% in multimodal reasoning consistency. Evaluated on mainstream VL-RM benchmarks, it advances reinforcement learning–driven alignment of VL models toward practical deployment.

Technology Category

Application Category

📝 Abstract

Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Addressing bootstrapping dilemma in Vision-Language Reward Model training

Mitigating modality bias and negative example amplification issues

Improving VL model alignment via iterative training and vision experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative training with vision experts

Chain-of-Thought rationales for structured feedback

Margin-based Rejection Sampling for data refinement

🔎 Similar Papers

Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?