🤖 AI Summary
Multimodal models often underperform relative to unimodal baselines due to modality dominance—where one modality (e.g., video) overwhelmingly drives predictions at the expense of others (e.g., audio). To address this, we propose attribution regularization: a novel training paradigm that explicitly enforces balanced multimodal contribution by incorporating differentiable, interpretability-driven gradient-based attribution constraints—such as Integrated Gradients—directly into the end-to-end optimization objective. Unlike prior fusion architectures that implicitly favor dominant modalities, our method dynamically regularizes cross-modal importance via a differentiable penalty term, mitigating imbalance at the optimization level. Evaluated on multiple video-audio benchmark tasks, our approach achieves an average accuracy gain of 3.2% over the strongest unimodal baseline, while reducing inter-modal contribution variance by 67%. These results empirically confirm a strong correlation between attribution balance and performance improvement.
📝 Abstract
Multimodal machine learning has gained significant attention in recent years due to its potential for integrating information from multiple modalities to enhance learning and decision-making processes. However, it is commonly observed that unimodal models outperform multimodal models, despite the latter having access to richer information. Additionally, the influence of a single modality often dominates the decision-making process, resulting in suboptimal performance. This research project aims to address these challenges by proposing a novel regularization term that encourages multimodal models to effectively utilize information from all modalities when making decisions. The focus of this project lies in the video-audio domain, although the proposed regularization technique holds promise for broader applications in embodied AI research, where multiple modalities are involved. By leveraging this regularization term, the proposed approach aims to mitigate the issue of unimodal dominance and improve the performance of multimodal machine learning systems. Through extensive experimentation and evaluation, the effectiveness and generalizability of the proposed technique will be assessed. The findings of this research project have the potential to significantly contribute to the advancement of multimodal machine learning and facilitate its application in various domains, including multimedia analysis, human-computer interaction, and embodied AI research.