🤖 AI Summary
Multimodal large language models (MLLMs) suffer significant performance degradation on unimodal tasks—such as image classification and text-only question answering—due to interference from irrelevant modality signals, a phenomenon formally defined as “modality interference,” quantifiably reflecting cross-modal capability imbalance. To address this, we propose the first systematic diagnostic and mitigation framework: (1) an interpretable, causally grounded experimental protocol for quantifying modality interference via targeted modality perturbations; and (2) a joint optimization strategy integrating PGD-based adversarial data augmentation with cross-modal consistency regularization. Evaluated across diverse image, text, and VQA benchmarks, our approach substantially improves unimodal reasoning robustness and multimodal generalization while maintaining compatibility across MLLM scales and architectures.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model's inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method's effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.