🤖 AI Summary
In multimodal learning, models often over-rely on dominant modalities, leading to underutilization of weak modalities and degraded generalization. To address this modality imbalance, we propose a semantic-inconsistency-driven data augmentation framework. First, cross-modal misaligned samples are generated based on unimodal confidence scores. Second, a dynamic weighting mechanism jointly optimizes both the contribution weights of weak modalities and the sampling weights for hard examples. Third, a feature-similarity-guided hard-example prioritization strategy is introduced to enhance discriminative learning. The method requires no additional annotations and effectively mitigates modality bias. It significantly improves model robustness against ambiguous or noisy inputs. Evaluated on major multimodal classification benchmarks—including MM-IMDB and CMU-MOSEI—our approach achieves state-of-the-art performance, demonstrating its effectiveness in balancing modality contributions and strengthening weak-modality representations.
📝 Abstract
Multimodal models often over-rely on dominant modalities, failing to achieve optimal performance. While prior work focuses on modifying training objectives or optimization procedures, data-centric solutions remain underexplored. We propose MIDAS, a novel data augmentation strategy that generates misaligned samples with semantically inconsistent cross-modal information, labeled using unimodal confidence scores to compel learning from contradictory signals. However, this confidence-based labeling can still favor the more confident modality. To address this within our misaligned samples, we introduce weak-modality weighting, which dynamically increases the loss weight of the least confident modality, thereby helping the model fully utilize weaker modality. Furthermore, when misaligned features exhibit greater similarity to the aligned features, these misaligned samples pose a greater challenge, thereby enabling the model to better distinguish between classes. To leverage this, we propose hard-sample weighting, which prioritizes such semantically ambiguous misaligned samples. Experiments on multiple multimodal classification benchmarks demonstrate that MIDAS significantly outperforms related baselines in addressing modality imbalance.