Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing pre-trained vision-language models (VLMs) lack explicit modeling of modality-inherent uncertainty, particularly overlooking the asymmetric uncertainty structure between text and vision within the unit hypersphere embedding space. To address this, we propose AsymVLM—the first framework to explicitly model cross-modal uncertainty asymmetry while constraining probabilistic embeddings to the unit hypersphere. Our approach employs the von Mises–Fisher distribution for modality-specific probabilistic representation, integrated with hyperspherical manifold optimization and uncertainty decoupling. This overcomes the limitations of conventional posterior adaptation methods, which ignore geometric constraints and modality-specific characteristics. AsymVLM significantly improves uncertainty calibration and downstream robustness across multimodal understanding and retrieval tasks. Ablation studies empirically validate the intrinsic asymmetry between textual and visual uncertainty structures.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) as foundation models have significantly enhanced performance across a wide range of visual and textual tasks, without requiring large-scale training from scratch for downstream tasks. However, these deterministic VLMs fail to capture the inherent ambiguity and uncertainty in natural language and visual data. Recent probabilistic post-hoc adaptation methods address this by mapping deterministic embeddings onto probability distributions; however, existing approaches do not account for the asymmetric uncertainty structure of the modalities, and the constraint that meaningful deterministic embeddings reside on a unit hypersphere, potentially leading to suboptimal performance. In this paper, we address the asymmetric uncertainty structure inherent in textual and visual data, and propose AsymVLM to build probabilistic embeddings from pre-trained VLMs on the unit hypersphere, enabling uncertainty quantification. We validate the effectiveness of the probabilistic embeddings on established benchmarks, and present comprehensive ablation studies demonstrating the inherent nature of asymmetry in the uncertainty structure of textual and visual data.
Problem

Research questions and friction points this paper is trying to address.

Address asymmetric uncertainty in vision-language models
Map deterministic embeddings to unit hypersphere distributions
Enable uncertainty quantification for visual-textual data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric uncertainty modeling for VLMs
Probabilistic embeddings on unit hypersphere
Uncertainty quantification in vision-language tasks
🔎 Similar Papers
No similar papers found.
Li Ju
Li Ju
Department of Information Technology, Uppsala University
Federated LearningDistributed OptimizationUncertainty QuantificationMultimodal Language Models
M
Max Andersson
Department of Information Technology, Uppsala University, Uppsala, Sweden
S
Stina Fredriksson
Department of Information Technology, Uppsala University, Uppsala, Sweden
E
Edward Glockner
Department of Information Technology, Uppsala University, Uppsala, Sweden
Andreas Hellander
Andreas Hellander
Associate Professor in Scientific Computing, Division of Scientific Computing, Department of
Computational Systems BiologyScientific ComputingCloud ComputingData Science
Ekta Vats
Ekta Vats
Uppsala University
Computer visionMachine learningImage and Video Analysis
P
Prashant Singh
Science for Life Laboratory, Uppsala University, Uppsala, Sweden