🤖 AI Summary
To address modality imbalance in multimodal learning caused by uneven single-modality data sampling, this paper proposes a data-aware dynamic unimodal sampling framework. Methodologically, it introduces (1) Cumulative Modality Difference (CMD), the first differentiable and monitorable metric for quantifying modality imbalance; (2) an adaptive sampling strategy jointly driven by heuristic scheduling and Proximal Policy Optimization (PPO)-based reinforcement learning, enabling end-to-end optimization of the sampling process; and (3) a plug-and-play module that seamlessly integrates with mainstream multimodal architectures. Evaluated on multiple benchmark datasets, the framework achieves an average accuracy improvement of 2.3% over state-of-the-art methods, demonstrating that regulating modality balance at the data sampling stage is critical to enhancing model performance.
📝 Abstract
To address the modality learning degeneration caused by modality imbalance, existing multimodal learning~(MML) approaches primarily attempt to balance the optimization process of each modality from the perspective of model learning. However, almost all existing methods ignore the modality imbalance caused by unimodal data sampling, i.e., equal unimodal data sampling often results in discrepancies in informational content, leading to modality imbalance. Therefore, in this paper, we propose a novel MML approach called underline{D}ata-aware underline{U}nimodal underline{S}ampling~(method), which aims to dynamically alleviate the modality imbalance caused by sampling. Specifically, we first propose a novel cumulative modality discrepancy to monitor the multimodal learning process. Based on the learning status, we propose a heuristic and a reinforcement learning~(RL)-based data-aware unimodal sampling approaches to adaptively determine the quantity of sampled data at each iteration, thus alleviating the modality imbalance from the perspective of sampling. Meanwhile, our method can be seamlessly incorporated into almost all existing multimodal learning approaches as a plugin. Experiments demonstrate that method~can achieve the best performance by comparing with diverse state-of-the-art~(SOTA) baselines.