Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

๐Ÿ“… 2024-09-26
๐Ÿ›๏ธ Trans. Mach. Learn. Res.
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing deep multimodal learning methods predominantly rely on pairwise modality contrast, limiting their capacity to model complex, higher-order cross-modal semantic sharing prevalent in real-world scenarios. To address this, we propose a Mixup-based contrastive loss for multimodal classification, enabling fine-grained semantic alignment via cross-modal sample mixing. We further design a joint training framework integrating a fusion module with unimodal auxiliary prediction tasks to strengthen shared representation learning. To our knowledge, this is the first work to incorporate Mixup into multimodal contrastive learningโ€”moving beyond conventional bimodal constraints and explicitly modeling higher-order modality collaboration. Extensive experiments demonstrate state-of-the-art performance on N24News, ROSMAP, and BRCA, and competitive results on Food-101, validating both strong cross-domain generalization and effective modeling of shared cross-modal semantics.

Technology Category

Application Category

๐Ÿ“ Abstract
Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research. Our code is publicly available at https://github.com/RaghavSinghal10/M3CoL.
Problem

Research questions and friction points this paper is trying to address.

Captures shared relations in multimodal data beyond pairwise associations
Learns robust representations via Mixup-based contrastive loss alignment
Improves multimodal classification by integrating fusion and unimodal modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Mixup Contrastive Learning captures shared relations
Mixup-based contrastive loss aligns mixed samples across modalities
Fusion module with auxiliary supervision enhances classification
R
Raja Kumar
Indian Institute of Technology Bombay, Mumbai, India
R
Raghav Singhal
Indian Institute of Technology Bombay, Mumbai, India
P
Pranamya Kulkarni
Indian Institute of Technology Bombay, Mumbai, India
Deval Mehta
Deval Mehta
Founding Member & Research Fellow at AIM for Health Lab | Monash University
Multi-modal AI for HealthcareFoundation Models / LLMsHealth Equity and Responsible AI
Kshitij Jadhav
Kshitij Jadhav
IIT Bombay
AIML in Healthcare