Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing uncertainty quantification methods for large vision-language models (LVLMs) largely adopt approaches designed for pure language models, thereby neglecting the critical role of visual signals in shaping prediction confidence. This work systematically investigates the influence of visual information on LVLM uncertainty and reveals that high-confidence predictions are significantly grounded in visual content. Building on this insight, we propose VIG-TUQ, a training-free, token-level uncertainty quantification method that leverages visual grounding to reweight linguistic uncertainty. Specifically, VIG-TUQ analyzes fused hidden representations and introduces a visual grounding score to modulate token-level confidence estimates. The approach is architecture-agnostic, accommodating early fusion, late fusion, and native multimodal integration schemes. Extensive experiments demonstrate that VIG-TUQ consistently outperforms existing token-level uncertainty estimation methods across multiple datasets and LVLM architectures.

📝 Abstract

Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the contribution of visual information to LVLM uncertainty largely underexplored. In this paper, we investigate how LVLMs process visual information and whether this process can be used to improve uncertainty estimation. By analyzing hidden representations after the integration of visual features during the generation process, we observe that high-confidence predictions rely more heavily on visual content than uncertain ones. Building on this insight, we propose Visual-Grounded Token UQ (VIG-TUQ), a training-free framework that explicitly incorporates visual grounding into uncertainty estimation by weighting token-level language uncertainty with visual grounding scores. We evaluate VIG-TUQ on multiple datasets and across diverse LVLM architectures, including early-fusion, late-fusion, and native-fusion models. Results indicate that our method often improves upon existing token-level uncertainty approaches. Code and data will be made available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Uncertainty Quantification

Vision-Language Models

Visual Signals

Token-Level Uncertainty

Large Vision Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty quantification

vision-language models

visual grounding