A Principled Framework for Multi-View Contrastive Learning

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing multi-view contrastive learning suffers from four key limitations: conflicting optimization objectives, incomplete modeling of view-sample interactions, coupled alignment-uniformity trade-offs, and difficulty in harnessing diversity gains across views. This paper proposes MVCL—the first unified framework for multi-view contrastive learning—introducing two novel loss functions: MV-InfoNCE and MV-DHEL. MVCL enables joint modeling across *all* pairwise view combinations and theoretically guarantees asymptotic decoupling of alignment and uniformity optimization. It natively supports extension to two or more modalities. Grounded in information theory, MVCL integrates multi-view augmentation with a decoupled optimization mechanism. Evaluated on four benchmarks including ImageNet1K, MVCL significantly outperforms state-of-the-art methods. Empirical results demonstrate consistent performance gains with increasing numbers of views, and effective mitigation of dimensional collapse beyond five views.

Technology Category

Application Category

📝 Abstract

Contrastive Learning (CL), a leading paradigm in Self-Supervised Learning (SSL), typically relies on pairs of data views generated through augmentation. While multiple augmentations per instance (more than two) improve generalization in supervised learning, current CL methods handle additional views suboptimally by simply aggregating different pairwise objectives. This approach suffers from four critical limitations: (L1) it utilizes multiple optimization terms per data point resulting to conflicting objectives, (L2) it fails to model all interactions across views and data points, (L3) it inherits fundamental limitations (e.g. alignment-uniformity coupling) from pairwise CL losses, and (L4) it prevents fully realizing the benefits of increased view multiplicity observed in supervised settings. We address these limitations through two novel loss functions: MV-InfoNCE, which extends InfoNCE to incorporate all possible view interactions simultaneously in one term per data point, and MV-DHEL, which decouples alignment from uniformity across views while scaling interaction complexity with view multiplicity. Both approaches are theoretically grounded - we prove they asymptotically optimize for alignment of all views and uniformity, providing principled extensions to multi-view contrastive learning. Our empirical results on ImageNet1K and three other datasets demonstrate that our methods consistently outperform existing multi-view approaches and effectively scale with increasing view multiplicity. We also apply our objectives to multimodal data and show that, in contrast to other contrastive objectives, they can scale beyond just two modalities. Most significantly, ablation studies reveal that MV-DHEL with five or more views effectively mitigates dimensionality collapse by fully utilizing the embedding space, thereby delivering multi-view benefits observed in supervised learning.

Problem

Research questions and friction points this paper is trying to address.

Optimize multi-view contrastive learning with fewer conflicts

Model all interactions across views and data points

Decouple alignment from uniformity in multi-view learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

MV-InfoNCE integrates all view interactions efficiently

MV-DHEL decouples alignment and uniformity scaling

Methods optimize multi-view alignment and uniformity theoretically

🔎 Similar Papers

No similar papers found.