Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenge of hallucinated, visually ungrounded responses in large vision-language models (LVLMs), which often arise from overreliance on linguistic priors and evade detection by existing confidence estimation methods. To tackle this, the authors propose BICR, a novel framework that explicitly quantifies the actual influence of visual input on model predictions. BICR trains a lightweight probe using a blind image-contrastive ranking loss, comparing hidden states of a frozen LVLM conditioned on original versus masked images, thereby leveraging visual groundedness as a confidence signal. The approach incurs no additional inference overhead, is model-agnostic, and highly efficient. Extensive experiments across five prominent LVLMs and seven baselines demonstrate that BICR achieves state-of-the-art average performance in both confidence calibration and discrimination, significantly enhancing discriminative capability while using only 1/4 to 1/18 the parameters of the strongest probe-based baseline.

📝 Abstract

Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.

Problem

Research questions and friction points this paper is trying to address.

visual ungroundedness

confidence estimation

vision-language models

image grounding

hallucination detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence estimation

visual grounding

blind-image contrastive ranking