Online Self-Calibration Against Hallucination in Vision-Language Models

πŸ“… 2026-04-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

206K/year
πŸ€– AI Summary
This work addresses the hallucination problem in large vision-language models, which often stems from a mismatch between their generative and discriminative capabilities. To mitigate this issue, the authors propose OSCAR, a novel framework that leverages the model’s internal discrepancy between generation and discrimination to construct high-quality preference data via Monte Carlo tree search and a dual-granularity reward mechanism. Integrated with direct preference optimization, OSCAR enables online self-calibration without external supervision. The method achieves state-of-the-art performance across multiple hallucination evaluation benchmarks while significantly enhancing the model’s multimodal understanding and generalization in both comprehension and generation tasks.
πŸ“ Abstract
Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.
Problem

Research questions and friction points this paper is trying to address.

hallucination
vision-language models
preference alignment
supervision-perception mismatch
self-supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Self-Calibration
Generative-Discriminative Gap
Direct Preference Optimization
Monte Carlo Tree Search
Vision-Language Models
M
Minghui Chen
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Chenxu Yang
Chenxu Yang
Institute of Information Engineering, Chinese Academy of Sciences
NLPDialogue Generation
H
Hengjie Zhu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
D
Dayan Wu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Zheng Lin
Zheng Lin
Institute of Information Engineering, CAS
NLP
Q
Qingyi Si
JD.COM