See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Vision-language (VL) models suffer from “dominant modality bias,” where gradients from one modality—vision or language—overwhelm the optimization process during training, leading to strong modality dependence and poor robustness. This work is the first to formalize modality imbalance from a **gradient alignment perspective**, proposing **cross-modal gradient reweighting** and **cross-task gradient projection** to enforce modality-coordinated convergence at the loss level. Our method comprises dynamic gradient reweighting, KL-divergence–based adaptive regularization, and joint vision-language fine-tuning. Evaluated on UPMC Food-101, Hateful Memes, and MM-IMDb, it significantly alleviates modality dependence: under modality corruption, average accuracy improves by 4.2–7.8%. The approach establishes an interpretable, scalable gradient-coordination paradigm for robust multimodal learning.

Technology Category

Application Category

📝 Abstract

Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to"dominant modality bias.'' This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality's contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.

Problem

Research questions and friction points this paper is trying to address.

Mitigate dominant modality bias in vision-language models.

Address unbalanced convergence due to unaligned gradients.

Reduce over-reliance on specific impaired modalities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inter-modality gradient reweighting for balance

Adjust KL divergence gradient per modality

Inter-task gradient projection for alignment

🔎 Similar Papers

Can We Talk Models Into Seeing the World Differently?