Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work identifies an intrinsic text bias in multimodal large language models (MLLMs), wherein models over-rely on textual inputs while suppressing visual evidence during inference. Method: We attribute this bias, for the first time, to architectural misalignment—specifically, distributional mismatch between vision-derived key vectors and the pretrained text key space, leading to systematic attenuation of visual information under attention mechanisms. Using t-SNE visualization and Jensen–Shannon divergence quantification, we demonstrate significant subspace separation between vision and text keys in the attention space of LLaVA and Qwen2.5-VL, with cross-modal divergence substantially exceeding intra-modal variation. Contribution/Results: Our analysis breaks from the conventional external “data imbalance” explanation, providing the first direct, representation-level evidence of multimodal fusion bottlenecks rooted in internal model architecture—thereby advancing fundamental understanding of MLLM limitations and guiding principled design of balanced multimodal attention.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

Problem

Research questions and friction points this paper is trying to address.

MLLMs exhibit text bias limiting visual reasoning capabilities

Bias originates from internal attention key-space misalignment

Visual keys are OOD relative to text key space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes attention key-space distribution divergence

Identifies visual keys as out-of-distribution elements

Reveals intrinsic architectural cause of text bias

🔎 Similar Papers

No similar papers found.