Aligning Multimodal Representations through an Information Bottleneck

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Multimodal contrastive learning, widely adopted for representation alignment, often fails to achieve semantic consistency because standard contrastive losses maximize mutual information without suppressing modality-specific information. Method: We introduce the information bottleneck principle into multimodal alignment for the first time, proposing a differentiable variational regularizer that explicitly enforces modality-invariant representations and suppresses modality-specific features within the contrastive learning framework. Contribution/Results: Our method requires no additional annotations and significantly improves alignment accuracy and semantic consistency in controlled ablation studies and cross-modal retrieval tasks. Empirical results demonstrate both the effectiveness and generalizability of information-bottleneck-driven regularization for multimodal representation learning.

Technology Category

Application Category

📝 Abstract

Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that their use is not effective to learn an aligned representation space. In this paper, we argue that this phenomenon is caused by the presence of modality-specific information in the representation space. Although some of the most widely used contrastive losses maximize the mutual information between representations of both modalities, they are not designed to remove the modality-specific information. We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect the emergence of this phenomenon in a controlled experimental setup. Finally, we propose a regularization term in the loss function that is derived by means of a variational approximation and aims to increase the representational alignment. We analyze in a set of controlled experiments and real-world applications the advantages of including this regularization term.

Problem

Research questions and friction points this paper is trying to address.

Addresses ineffective alignment in multimodal representation learning

Identifies modality-specific information as a key alignment obstacle

Proposes regularization to enhance representational alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Information Bottleneck Principle for alignment

Introduces variational regularization in loss function

Removes modality-specific information effectively

🔎 Similar Papers

What to align in multimodal contrastive learning?