MASS: Overcoming Language Bias in Image-Text Matching

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In image–text matching, existing vision-language models (VLMs) over-rely on linguistic priors, leading to insufficient visual content modeling and misalignment in cross-modal grounding. To address this, we propose the Multimodal Association Score (MASS) framework—a plug-and-play, zero-shot, post-hoc language bias correction method requiring no fine-tuning or additional training overhead. MASS employs attention reweighting and contrastive semantic disentanglement to enhance visual fidelity while preserving compositional language understanding. It is fully compatible with mainstream VLMs such as CLIP and ALPRO. Evaluated on Flickr30K and COCO, MASS reduces language bias by 37% (measured via standard bias metrics) and improves image–text retrieval accuracy by 2.1–3.8 percentage points. These gains reflect substantially improved visual grounding accuracy in cross-modal alignment, demonstrating that effective bias mitigation can be achieved without modifying model parameters or training pipelines.

Technology Category

Application Category

📝 Abstract
Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.
Problem

Research questions and friction points this paper is trying to address.

Image-Text Matching
Bias Dependence
Language Fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Association Score
Visual Content Emphasis
Fair and Accurate Pairing
🔎 Similar Papers
No similar papers found.
Jiwan Chung
Jiwan Chung
Yonsei University
Computer VisionNLPMultimodal Learning
Seungwon Lim
Seungwon Lim
Yonsei University
NLPMultimodal LearningAgent
S
Sangkyu Lee
Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, South Korea
Y
Youngjae Yu
Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, South Korea