🤖 AI Summary
In image–text matching, existing vision-language models (VLMs) over-rely on linguistic priors, leading to insufficient visual content modeling and misalignment in cross-modal grounding. To address this, we propose the Multimodal Association Score (MASS) framework—a plug-and-play, zero-shot, post-hoc language bias correction method requiring no fine-tuning or additional training overhead. MASS employs attention reweighting and contrastive semantic disentanglement to enhance visual fidelity while preserving compositional language understanding. It is fully compatible with mainstream VLMs such as CLIP and ALPRO. Evaluated on Flickr30K and COCO, MASS reduces language bias by 37% (measured via standard bias metrics) and improves image–text retrieval accuracy by 2.1–3.8 percentage points. These gains reflect substantially improved visual grounding accuracy in cross-modal alignment, demonstrating that effective bias mitigation can be achieved without modifying model parameters or training pipelines.
📝 Abstract
Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.