The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This paper addresses the challenge of jointly modeling strong pairwise alignment and higher-order (e.g., XOR-type) inter-modal dependencies in multimodal joint representation learning. To this end, we propose ConFu, a contrastive fusion framework that jointly optimizes unimodal and fused multimodal representations within a unified embedding space. ConFu introduces, for the first time, a fused-modal contrastive loss that explicitly captures higher-order interactions and enables both one-to-one bidirectional and two-to-one cross-modal retrieval. By extending the contrastive learning objective and co-optimizing multimodal fusion encoders with the joint embedding space, ConFu achieves significant improvements over state-of-the-art methods on synthetic and real-world benchmarks—including MM-IMDB and Clotho—across cross-modal retrieval and classification tasks. Moreover, the framework exhibits strong computational scalability.

Technology Category

Application Category

📝 Abstract

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

Problem

Research questions and friction points this paper is trying to address.

Learning joint representations across multiple modalities simultaneously

Capturing higher-order interactions while preserving pairwise relationships

Enabling unified multimodal alignment beyond pairwise contrastive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Fusion framework for multimodal alignment

Extends pairwise contrastive objectives with fused-modality term

Captures higher-order dependencies while maintaining pairwise correspondence

🔎 Similar Papers

What to align in multimodal contrastive learning?