GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing dual-encoder vision-language models struggle to jointly model aleatoric and epistemic uncertainties across modalities and often overlook the hyperspherical geometry of embedding spaces. This work proposes GeoFlowVLM, a post-hoc adapter operating under frozen backbones, which—on a product hypersphere—unifies the learning of joint distributions and bidirectional conditional flows for paired embeddings via Riemannian flow matching using only a single masked velocity field. The approach yields an aleatoric entropy grounded in decision theory and an epistemic score based on marginal typicality. Experiments demonstrate that the proposed entropy metric exhibits strong monotonic calibration with Recall@1 across three retrieval benchmarks, while the epistemic score achieves consistently calibrated selective accuracy on four zero-shot classification tasks.

📝 Abstract

Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $\ell_2$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalised dual-encoder VLM embeddings on the product hypersphere $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.

Problem

Research questions and friction points this paper is trying to address.

aleatoric uncertainty

epistemic uncertainty

vision-language models

hyperspherical geometry

uncertainty quantification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Riemannian flow matching

joint uncertainty quantification

hyperspherical geometry