🤖 AI Summary
This work addresses the pervasive issue of language modality over-dominance in multimodal language models, which often leads to degraded performance on vision-centric tasks. The authors propose a training-free, inference-stage correction method that mitigates linguistic suppression of visual information by extracting textual centroids via K-means clustering and integrating centroid replacement with contrastive decoding. This approach uncovers the underlying modality competition mechanism shared across diverse multimodal architectures. Empirically, it recovers up to 16.9% accuracy on individual tasks, yields an average improvement of 5.6% for standard fine-tuned models, and enhances preference-optimized models by 1.5%, thereby establishing a novel paradigm for addressing modality imbalance without additional training.
📝 Abstract
Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.