Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the “modality gap” in multimodal large language models (MLLMs)—a critical imbalance between visual and textual reasoning capabilities, wherein models over-rely on linguistic cues while underutilizing visual information, leading to degraded performance on tasks demanding deep visual reasoning. We first systematically identify how prevailing training strategies—particularly data construction heuristics and loss function design—exacerbate this gap. To mitigate it, we propose a dual-path framework: (1) contrastive learning–guided visual-perception–enhanced data sampling, and (2) a vision alignment loss that explicitly enforces cross-modal representation consistency. Integrated with multimodal instruction tuning, our approach significantly narrows the performance gap between text- and vision-dominated evaluation benchmarks. Empirical results demonstrate consistent improvements across multiple VQA and image captioning benchmarks. The framework is fully reproducible and provides a principled, training-centric solution for achieving balanced multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically, current MLLMs often over-rely on textual cues while under-attending to visual content, resulting in suboptimal performance on tasks that require genuine visual reasoning. We refer to this phenomenon as the extit{modality gap}, defined as the performance disparity between text-centric and vision-centric inputs. In this paper, we analyze the modality gap through the lens of training recipes. We first show that existing training recipes tend to amplify this gap. Then, we systematically explore strategies to bridge it from two complementary perspectives: data and loss design. Our findings provide insights into developing training recipes that mitigate the modality gap and promote more balanced multimodal reasoning. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Bridging-Modality-Gap.
Problem

Research questions and friction points this paper is trying to address.

Analyzing the text-vision reasoning imbalance in multimodal language models
Investigating training recipes that amplify visual-textual modality gaps
Exploring data and loss strategies to bridge multimodal reasoning disparities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing modality gap through training recipes
Mitigating imbalance via data and loss design
Developing balanced multimodal reasoning strategies
🔎 Similar Papers
No similar papers found.
G
Guanyu Yao
UC Santa Barbara
Q
Qiucheng Wu
UC Santa Barbara
Y
Yang Zhang
MIT-IBM Watson AI Lab
Z
Zhaowen Wang
Adobe Research
Handong Zhao
Handong Zhao
Adobe Research
Shiyu Chang
Shiyu Chang
University of California, Santa Barbara
Machine LearningNatural Language ProcessingComputer Vision