🤖 AI Summary
To address performance degradation caused by inter-modal learning imbalance in multimodal learning, this paper formulates modality imbalance as a multi-objective optimization problem for the first time and proposes a gradient-driven, efficient optimization algorithm. Methodologically, we design a lightweight optimization framework with theoretical convergence guarantees, eliminating costly subroutine calls inherent in conventional multi-objective methods—reducing computational time by up to 20×. Innovatively integrating multi-objective optimization principles with multimodal feature alignment, our approach achieves dynamic inter-modal learning balance without increasing model complexity. Extensive experiments on mainstream multimodal benchmarks—including MM-IMDB and CMU-MOSEI—demonstrate consistent superiority over existing balancing-learning and multi-objective optimization baselines, validating both effectiveness and generalizability.
📝 Abstract
Multi-modal learning (MML) aims to integrate information from multiple modalities, which is expected to lead to superior performance over single-modality learning. However, recent studies have shown that MML can underperform, even compared to single-modality approaches, due to imbalanced learning across modalities. Methods have been proposed to alleviate this imbalance issue using different heuristics, which often lead to computationally intensive subroutines. In this paper, we reformulate the MML problem as a multi-objective optimization (MOO) problem that overcomes the imbalanced learning issue among modalities and propose a gradient-based algorithm to solve the modified MML problem. We provide convergence guarantees for the proposed method, and empirical evaluations on popular MML benchmarks showcasing the improved performance of the proposed method over existing balanced MML and MOO baselines, with up to ~20x reduction in subroutine computation time. Our code is available at https://github.com/heshandevaka/MIMO.