🤖 AI Summary
This work addresses two critical bottlenecks in training vision-language reward models (VL-RMs): (1) the “guidance dilemma” arising from reliance on powerful vision-language models to generate high-quality preference data, and (2) modality bias and negative sample amplification caused by multimodal hallucinations. We propose the first iterative training framework integrating vision-expert guidance, chain-of-thought (CoT)-structured critique, and margin-based rejection sampling—breaking the bootstrapping loop while explicitly suppressing visual-attribute hallucinations and modality bias. Through multiple rounds of preference data refinement, our method significantly improves VL-RM performance: +12.3% in hallucination detection accuracy and +9.8% in multimodal reasoning consistency. Evaluated on mainstream VL-RM benchmarks, it advances reinforcement learning–driven alignment of VL models toward practical deployment.
📝 Abstract
Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.