Towards Uniformity and Alignment for Multimodal Representation Learning

📅 2026-02-10
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the modality gap in multimodal representation learning induced by the InfoNCE objective, which manifests as a conflict between inter-modal alignment and uniformity, as well as intra-modal alignment inconsistencies. The paper proposes the first framework that decouples alignment and uniformity in multimodal learning, employing Hölder divergence–based alignment optimization alongside a dedicated uniformity loss to effectively mitigate these conflicts. Theoretically, the proposed objective is shown to serve as a valid proxy for the global Hölder divergence between multimodal distributions. Notably, the method requires no task-specific components and consistently improves performance across both discriminative tasks (e.g., retrieval) and generative tasks (e.g., UnCLIP), demonstrating its generality and effectiveness.

Technology Category

Application Category

📝 Abstract
Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global H\"older divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.
Problem

Research questions and friction points this paper is trying to address.

multimodal representation learning
alignment-uniformity conflict
intra-alignment conflict
distribution gap
InfoNCE
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal representation learning
alignment-uniformity decoupling
distribution gap reduction
Hölder divergence
conflict-free multimodal learning
🔎 Similar Papers
No similar papers found.