🤖 AI Summary
This work uncovers the intrinsic geometric mechanisms of the InfoNCE loss in contrastive learning, moving beyond the conventional alignment–uniformity decomposition framework. By modeling contrastive learning as the evolution of a representation measure on an embedding manifold and integrating tools from measure theory, large-batch asymptotics, and energy landscape analysis, the authors establish a unified geometric framework. They show that in the unimodal setting, a unique Gibbs equilibrium exists and the energy landscape is strictly convex, whereas in multimodal scenarios, the negative symmetric divergence term induces structural modal gaps, leading to distributional misalignment. Furthermore, uniformity is reinterpreted as constrained entropy expansion within the alignment basin, offering a theoretical foundation for diagnosing and controlling multimodal contrastive learning dynamics.
📝 Abstract
While InfoNCE powers modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying"uniformity"as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.