Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (e.g., CLIP) often assign high confidence to incorrect predictions in open-vocabulary classification, compromising reliability in safety-critical applications. To address this, we propose a training-free, post-hoc uncertainty estimation method. Our approach introduces class-specific probabilistic embeddings: leveraging image encoder features, it constructs multivariate Gaussian distributions in the projection space to model intra-class visual consistency. These embeddings enable plug-and-play confidence calibration—robust to distributional shift—and require only ~10 samples per class for effective operation. Evaluated on benchmarks including ImageNet and Flowers102, our method substantially outperforms both deterministic and probabilistic baselines, achieving state-of-the-art performance in error detection.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at https://github.com/zhenxianglin/ICPE.
Problem

Research questions and friction points this paper is trying to address.

Estimates uncertainty in vision-language models to detect misclassifications
Creates probabilistic embeddings using intra-class visual feature consistency
Provides training-free, robust error detection with minimal data requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free post-hoc uncertainty estimation for VLMs
Class-specific probabilistic embeddings via multivariate Gaussians
Robust error detection with minimal training data
🔎 Similar Papers
No similar papers found.
Z
Zhenxiang Lin
Queensland University of Technology, Brisbane, Australia
M
Maryam Haghighat
Queensland University of Technology, Brisbane, Australia
W
Will Browne
Queensland University of Technology, Brisbane, Australia
Dimity Miller
Dimity Miller
Queensland University of Technology
Uncertainty EstimationRobotic VisionOpen-set RecognitionObject Detection