🤖 AI Summary
This work addresses two key limitations in multimodal contrastive learning: insufficient cross-modal representation alignment and narrow task support. We propose a unified framework based on conditional probability modeling, formalizing image–text contrastive learning as the joint optimization of parametric encoders for the conditional distributions $p(z_v mid z_t)$ and $p(z_t mid z_v)$. We introduce a probabilistic contrastive loss and a latent-space alignment metric; under a multivariate Gaussian assumption, alignment learning is equivalently reformulated as low-rank matrix approximation, endowing the method with statistical interpretability. Extensive evaluation on MNIST, synthetic Gaussian data, and an ocean data assimilation task demonstrates effectiveness across cross-modal retrieval, classification, and generation—consistently outperforming strong baselines. Notably, our approach significantly enhances pattern discovery and controllable generation under few-shot settings.
📝 Abstract
Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: the introduction of novel probabilistic loss functions, and the use of alternative metrics for measuring alignment in the common latent space. We study these generalizations of the classical approach in the multivariate Gaussian setting. In this context we view the latent space identification as a low-rank matrix approximation problem. This allows us to characterize the capabilities of loss functions and alignment metrics to approximate natural statistics, such as conditional means and covariances; doing so yields novel variants on contrastive learning algorithms for specific mode-seeking and for generative tasks. The framework we introduce is also studied through numerical experiments on multivariate Gaussians, the labeled MNIST dataset, and on a data assimilation application arising in oceanography.