🤖 AI Summary
This work investigates the representation-theoretic properties of multimodal contrastive learning under nonlinear, non-Gaussian data distributions, focusing on how such methods can transcend predefined vector-dimensional constraints to adaptively discover the data’s intrinsic low-dimensional structure. We propose a temperature-optimized multimodal contrastive learning framework and theoretically establish that—under mild assumptions—it simultaneously maximizes inter-modal mutual information and automatically identifies and compresses representations to the true intrinsic dimension of the shared latent variable. This constitutes the first theoretical characterization linking the learned representation dimension in contrastive learning to the underlying data manifold dimension. Experiments on synthetic benchmarks and real-world multimodal datasets (e.g., CC3M, Kinetics) demonstrate that the resulting representations are both low-dimensional and highly informative, effectively bridging the gap between theoretical analysis and empirical performance.
📝 Abstract
Multi-modal contrastive learning as a self-supervised representation learning technique has achieved great success in foundation model training, such as CLIP~citep{radford2021learning}. In this paper, we study the theoretical properties of the learned representations from multi-modal contrastive learning beyond linear representations and specific data distributions. Our analysis reveals that, enabled by temperature optimization, multi-modal contrastive learning not only maximizes mutual information between modalities but also adapts to intrinsic dimensions of data, which can be much lower than user-specified dimensions for representation vectors. Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance.