VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from pervasive text bias, leading to weak vision–language alignment—particularly severe under long-text conditions. To address this, we analyze the implicit misalignment inherent in cross-entropy loss from an information-theoretic perspective and propose an explicit alignment objective grounded in mutual information maximization. Our lightweight, parameter-free alignment paradigm requires no additional parameters or training data: it freezes the backbone network and introduces a theoretically motivated loss function, enabling plug-and-play enhancement of visual understanding. Evaluated across over ten benchmarks—including VQAv2, MMStar, and MME—our method consistently outperforms strong baselines. It significantly improves cross-modal fusion quality and visual reasoning capability, demonstrating robust generalization without architectural modification or extra supervision.

Technology Category

Application Category

📝 Abstract
Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the visual understanding capabilities of existing MLLMs without requiring any additional trainable modules or extra training data, making it both efficient and practical. Our method significantly outperforms baseline models across more than a dozen benchmark datasets, including VQAv2, MMStar, and MME, paving the way for new directions in MLLM modal alignment research.
Problem

Research questions and friction points this paper is trying to address.

Addressing modality bias in MLLMs favoring text over vision
Overcoming cross-modal alignment degradation with long text sequences
Enhancing visual understanding without extra modules or data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximizes cross-modal mutual information
Enhances vision-text alignment without extra modules
Improves visual understanding in MLLMs efficiently
🔎 Similar Papers
No similar papers found.