Multilingual Vision-Language Models, A Survey

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work identifies a fundamental tension in multilingual vision-language models (MLVLMs) between language neutrality—i.e., cross-lingual representational consistency—and cultural awareness—i.e., contextual adaptability. To investigate this, we systematically evaluate 31 MLVLMs across 21 benchmarks and propose a novel “language neutrality–cultural awareness” binary analytical framework. We uncover a structural misalignment: ~67% of existing benchmarks prioritize translation-equivalent semantic consistency while neglecting culturally grounded understanding. Methodologically, we introduce a contrastive learning paradigm that jointly leverages translation-aligned evaluation and emerging culture-sensitive assessment protocols. Results demonstrate that current MLVLMs consistently underperform in balancing cross-lingual generalization with culturally appropriate reasoning. Our contributions include: (1) a theoretically grounded framework clarifying the neutrality–awareness trade-off; (2) an expanded, culturally nuanced evaluation suite; and (3) empirical evidence guiding the development of more culturally aware MLVLMs.

Technology Category

Application Category

📝 Abstract

This survey examines multilingual vision-language models that process text and images across languages. We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.

Problem

Research questions and friction points this paper is trying to address.

Examines multilingual vision-language models processing text and images

Identifies tension between language neutrality and cultural awareness

Finds discrepancies in cross-lingual capabilities and evaluation gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual vision-language models process text and images

Training uses contrastive learning for language neutrality

Evaluation benchmarks incorporate culturally grounded content

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions