The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the pervasive issue of language modality over-dominance in multimodal language models, which often leads to degraded performance on vision-centric tasks. The authors propose a training-free, inference-stage correction method that mitigates linguistic suppression of visual information by extracting textual centroids via K-means clustering and integrating centroid replacement with contrastive decoding. This approach uncovers the underlying modality competition mechanism shared across diverse multimodal architectures. Empirically, it recovers up to 16.9% accuracy on individual tasks, yields an average improvement of 5.6% for standard fine-tuned models, and enhances preference-optimized models by 1.5%, thereby establishing a novel paradigm for addressing modality imbalance without additional training.

Technology Category

Application Category

📝 Abstract

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

Problem

Research questions and friction points this paper is trying to address.

multimodal language models

visual perception

modal competition

language dominance

centroid structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

centroid erasure

modal competition

contrastive decoding