🤖 AI Summary
This work addresses the challenge that existing vision-language models (VLMs) struggle to quantify epistemic uncertainty and effectively recognize their own knowledge gaps. To this end, it introduces Riemannian flow matching on the hyperspherical manifold of VLM embeddings for the first time, establishing a theoretical link between uncertainty and embedding distribution by estimating probability densities in the embedding space and using negative log-density as a proxy for epistemic uncertainty. The proposed method, REPVLM, achieves near-perfect correlation between uncertainty estimates and prediction errors, significantly outperforming current baselines. It demonstrates superior performance across multiple tasks, including classification, out-of-distribution detection, and data curation.
📝 Abstract
Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.