π€ AI Summary
Existing approaches struggle to model high-order dependencies among more than two modalities and lack a unified principle for balancing information retention and compression. This work introduces the information bottleneck principle into arbitrary multimodal alignment for the first time, proposing a One-vs-All multimodal alignment framework. By optimizing each modalityβs sufficiency and minimality with respect to all others, the method derives a computable contrastive lower bound and a minimality regularizer. It further integrates parameter-free geometry-aware projection and a distribution-dependent upper-bound regularizer to effectively capture high-order interactions and geometric structures. The proposed approach achieves consistently strong and state-of-the-art performance across diverse tasks, including classification, regression, modality-agnostic evaluation, and cross-modal retrieval.
π Abstract
Contrastive learning is effective for aligning paired views or modalities, but alignment beyond two modalities remains non-trivial and comparatively underexplored. Pairwise CLIP-style losses decompose multi-modal alignment into independent two-way comparisons and therefore do not explicitly model higher-order dependencies among multiple modalities. Recent beyond-pairwise objectives approach this problem from statistical or geometric perspectives, but arbitrary-modality alignment still lacks a principled criterion for defining what each modality should preserve and compress relative to the others. We revisit arbitrary-modality alignment through the Information Bottleneck principle. In multi-modal learning, sufficiency should preserve information predictable from the remaining modalities, while minimality should compress modality-specific information not supported by them. This naturally leads to a One-vs-All view, where each modality is characterized with respect to the remaining modalities. We propose OVA-IB, an Information Bottleneck framework for arbitrary-modality alignment. OVA-IB optimizes a tractable One-vs-All contrastive lower bound for sufficiency connected to a Dual Total Correlation-style objective, uses a parameter-free geometry-aware projection score, and derives a tractable upper-bound regularizer for minimality by bounding each representation's dependence on its own input with representation distributions induced by the remaining modalities. Experiments on classification, regression, modality-agnostic evaluation, and cross-modal retrieval benchmarks demonstrate strong and robust performance.