🤖 AI Summary
This work addresses the challenge of hallucinated, visually ungrounded responses in large vision-language models (LVLMs), which often arise from overreliance on linguistic priors and evade detection by existing confidence estimation methods. To tackle this, the authors propose BICR, a novel framework that explicitly quantifies the actual influence of visual input on model predictions. BICR trains a lightweight probe using a blind image-contrastive ranking loss, comparing hidden states of a frozen LVLM conditioned on original versus masked images, thereby leveraging visual groundedness as a confidence signal. The approach incurs no additional inference overhead, is model-agnostic, and highly efficient. Extensive experiments across five prominent LVLMs and seven baselines demonstrate that BICR achieves state-of-the-art average performance in both confidence calibration and discrimination, significantly enhancing discriminative capability while using only 1/4 to 1/18 the parameters of the strongest probe-based baseline.
📝 Abstract
Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.