The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work addresses the pervasive issue of language modality over-dominance in multimodal language models, which often leads to degraded performance on vision-centric tasks. The authors propose a training-free, inference-stage correction method that mitigates linguistic suppression of visual information by extracting textual centroids via K-means clustering and integrating centroid replacement with contrastive decoding. This approach uncovers the underlying modality competition mechanism shared across diverse multimodal architectures. Empirically, it recovers up to 16.9% accuracy on individual tasks, yields an average improvement of 5.6% for standard fine-tuned models, and enhances preference-optimized models by 1.5%, thereby establishing a novel paradigm for addressing modality imbalance without additional training.

Technology Category

Application Category

📝 Abstract
Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.
Problem

Research questions and friction points this paper is trying to address.

multimodal language models
visual perception
modal competition
language dominance
centroid structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

centroid erasure
modal competition
contrastive decoding
multimodal language models
modality imbalance
🔎 Similar Papers
No similar papers found.