A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Cross-lingual vision-language models suffer from cultural perception bias due to English-dominant training data, limiting their ability to accurately model object description diversity in non-English contexts. To address this, we propose a data-efficient framework featuring: (1) an LLM-driven multimodal re-description strategy that dynamically rewrites English image captions prior to translation, injecting culturally adaptive semantics; (2) a native-speaker-guided targeted annotation protocol ensuring both semantic fidelity and cultural plausibility; and (3) a fine-tuning paradigm for text–image retrieval that supports cross-lingual and cross-dataset generalization. Evaluated on German and Japanese image–text retrieval benchmarks, our approach achieves an average +3.5% improvement in recall@K and reduces non-native error cases by +4.7%. These results demonstrate substantially enhanced modeling of culturally grounded descriptive diversity and improved generalization robustness across languages and datasets.

Technology Category

Application Category

📝 Abstract

There are many ways to describe, name, and group objects when captioning an image. Differences are evident when speakers come from diverse cultures due to the unique experiences that shape perception. Machine translation of captions has pushed multilingual capabilities in vision-language models (VLMs), but data comes mainly from English speakers, indicating a perceptual bias and lack of model flexibility. In this work, we address this challenge and outline a data-efficient framework to instill multilingual VLMs with greater understanding of perceptual diversity. We specifically propose an LLM-based, multimodal recaptioning strategy that alters the object descriptions of English captions before translation. The greatest benefits are demonstrated in a targeted multimodal mechanism guided by native speaker data. By adding produced rewrites as augmentations in training, we improve on German and Japanese text-image retrieval cases studies (up to +3.5 mean recall overall, +4.7 on non-native error cases). We further propose a mechanism to analyze the specific object description differences across datasets, and we offer insights into cross-dataset and cross-language generalization.

Problem

Research questions and friction points this paper is trying to address.

Addresses perceptual bias in multilingual vision-language models

Enhances model flexibility for diverse cultural descriptions

Improves multilingual text-image retrieval with recaptioning strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based multimodal recaptioning strategy

Native speaker-guided targeted mechanism

Cross-dataset object description analysis

🔎 Similar Papers

Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception