ViLU: Learning Vision-Language Uncertainties for Failure Prediction

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unreliable uncertainty quantification (UQ) and failure prediction of vision-language models (VLMs). To this end, we propose an embedding-level, loss-agnostic binary classifier for uncertainty estimation. Methodologically: (1) We fuse multi-source textual representations conditioned on the image with visual embeddings via cross-attention, yielding context-aware multimodal representations; (2) we design a lightweight, embedding-only prediction head trained with weighted binary cross-entropy to distinguish correct from incorrect predictions—eliminating dependence on the original task’s loss function. Experiments on ImageNet-1k, CC12M, and LAION-400M demonstrate substantial improvements in failure prediction accuracy and uncertainty calibration over state-of-the-art UQ baselines. The approach is modular, requires no task-specific supervision, and supports plug-and-play meta-reasoning scenarios.

Technology Category

Application Category

📝 Abstract
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.
Problem

Research questions and friction points this paper is trying to address.

Quantifying uncertainty in Vision-Language Models (VLMs)
Predicting failures in multi-modal vision-language tasks
Post-hoc uncertainty estimation without model access
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages all task-relevant textual representations
Integrates visual and textual embeddings via cross-attention
Trains uncertainty predictor as loss-agnostic binary classifier
🔎 Similar Papers
No similar papers found.