Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

📅 2024-09-26

🏛️ Trans. Mach. Learn. Res.

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing deep multimodal learning methods predominantly rely on pairwise modality contrast, limiting their capacity to model complex, higher-order cross-modal semantic sharing prevalent in real-world scenarios. To address this, we propose a Mixup-based contrastive loss for multimodal classification, enabling fine-grained semantic alignment via cross-modal sample mixing. We further design a joint training framework integrating a fusion module with unimodal auxiliary prediction tasks to strengthen shared representation learning. To our knowledge, this is the first work to incorporate Mixup into multimodal contrastive learning—moving beyond conventional bimodal constraints and explicitly modeling higher-order modality collaboration. Extensive experiments demonstrate state-of-the-art performance on N24News, ROSMAP, and BRCA, and competitive results on Food-101, validating both strong cross-domain generalization and effective modeling of shared cross-modal semantics.

Technology Category

Application Category

📝 Abstract

Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research. Our code is publicly available at https://github.com/RaghavSinghal10/M3CoL.

Problem

Research questions and friction points this paper is trying to address.

Captures shared relations in multimodal data beyond pairwise associations

Learns robust representations via Mixup-based contrastive loss alignment

Improves multimodal classification by integrating fusion and unimodal modules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Mixup Contrastive Learning captures shared relations

Mixup-based contrastive loss aligns mixed samples across modalities

Fusion module with auxiliary supervision enhances classification

🔎 Similar Papers

What to align in multimodal contrastive learning?