VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the challenge that large vision-language models often generate hallucinated or incorrect answers with high confidence, and existing calibration methods struggle to differentiate between perceptual and reasoning errors. The authors propose a reinforcement learning framework that explicitly decouples visual confidence from reasoning confidence for the first time. Visual certainty is estimated in an unsupervised manner via KL divergence and token entropy under image perturbations, which informs a token-level advantage reweighting strategy to optimize calibration. The method achieves significant improvements in both calibration performance and reasoning accuracy across 13 benchmarks and demonstrates strong out-of-distribution generalization across diverse model scales and architectures.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
Problem

Research questions and friction points this paper is trying to address.

confidence calibration
vision-language models
hallucination
visual uncertainty
reasoning errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration
vision-language models
decoupled confidence
visual grounding
reinforcement learning
🔎 Similar Papers