HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address unreliable uncertainty estimation in vision-language models (VLMs) for high-stakes applications—such as autonomous driving and visual assistance—this paper proposes the first confidence calibration method that jointly models multimodal latent activations and output probability distributions, explicitly mitigating biases induced by linguistic priors. Our approach employs a lightweight MLP fusion network that synergistically integrates cross-modal hidden representations with token-level probability distributions to achieve semantic-aware uncertainty calibration. Evaluated on three open-domain visual question answering benchmarks—A-OKVQA, VizWiz, and PathVQA—our method achieves state-of-the-art performance, improving AUROC by up to 4% and Precision-Recall Recall (PRR) by 6%. This work establishes a novel, interpretable, and generalizable paradigm for uncertainty quantification, enabling trustworthy deployment of VLMs in safety-critical scenarios.

Technology Category

Application Category

📝 Abstract

The growing deployment of Vision-Language Models (VLMs) in high-stakes applications such as autonomous driving and assistive technologies for visually impaired individuals necessitates reliable mechanisms to assess the trustworthiness of their generation. Uncertainty Estimation (UE) plays a central role in quantifying the reliability of model outputs and reducing unsafe generations via selective prediction. In this regard, most existing probability-based UE approaches rely on output probability distributions, aggregating token probabilities into a single uncertainty score using predefined functions such as length-normalization. Another line of research leverages model hidden representations and trains MLP-based models to predict uncertainty. However, these methods often fail to capture the complex multimodal relationships between semantic and textual tokens and struggle to identify biased probabilities often influenced by language priors. Motivated by these observations, we propose a novel UE framework, HARMONY, that jointly leverages fused multimodal information in model activations and the output distribution of the VLM to determine the reliability of responses. The key hypothesis of our work is that both the model's internal belief in its visual understanding, captured by its hidden representations, and the produced token probabilities carry valuable reliability signals that can be jointly leveraged to improve UE performance, surpassing approaches that rely on only one of these components. Experimental results on three open-ended VQA benchmarks, A-OKVQA, VizWiz, and PathVQA, and three state-of-the-art VLMs, LLaVa-7b, LLaVA-13b and InstructBLIP demonstrate that our method consistently performs on par with or better than existing approaches, achieving up to 4% improvement in AUROC, and 6% in PRR, establishing new state of the art in uncertainty estimation for VLMs.

Problem

Research questions and friction points this paper is trying to address.

Estimating uncertainty in vision-language model outputs for reliability assessment

Capturing complex multimodal relationships between visual and textual tokens

Improving uncertainty estimation beyond probability-based or representation-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages fused multimodal hidden activation representations

Combines model output probabilities with internal beliefs

Jointly uses visual understanding and token probabilities

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives