π€ AI Summary
This work addresses the hallucination problem in large vision-language models, which often stems from a mismatch between their generative and discriminative capabilities. To mitigate this issue, the authors propose OSCAR, a novel framework that leverages the modelβs internal discrepancy between generation and discrimination to construct high-quality preference data via Monte Carlo tree search and a dual-granularity reward mechanism. Integrated with direct preference optimization, OSCAR enables online self-calibration without external supervision. The method achieves state-of-the-art performance across multiple hallucination evaluation benchmarks while significantly enhancing the modelβs multimodal understanding and generalization in both comprehension and generation tasks.
π Abstract
Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.