Online Self-Calibration Against Hallucination in Vision-Language Models

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the hallucination problem in large vision-language models, which often stems from a mismatch between their generative and discriminative capabilities. To mitigate this issue, the authors propose OSCAR, a novel framework that leverages the model’s internal discrepancy between generation and discrimination to construct high-quality preference data via Monte Carlo tree search and a dual-granularity reward mechanism. Integrated with direct preference optimization, OSCAR enables online self-calibration without external supervision. The method achieves state-of-the-art performance across multiple hallucination evaluation benchmarks while significantly enhancing the model’s multimodal understanding and generalization in both comprehension and generation tasks.

📝 Abstract

Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.

Problem

Research questions and friction points this paper is trying to address.

hallucination

vision-language models

preference alignment

supervision-perception mismatch

self-supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Self-Calibration

Generative-Discriminative Gap

Direct Preference Optimization