Multilingual Vision-Language Models, A Survey

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a fundamental tension in multilingual vision-language models (MLVLMs) between language neutrality—i.e., cross-lingual representational consistency—and cultural awareness—i.e., contextual adaptability. To investigate this, we systematically evaluate 31 MLVLMs across 21 benchmarks and propose a novel “language neutrality–cultural awareness” binary analytical framework. We uncover a structural misalignment: ~67% of existing benchmarks prioritize translation-equivalent semantic consistency while neglecting culturally grounded understanding. Methodologically, we introduce a contrastive learning paradigm that jointly leverages translation-aligned evaluation and emerging culture-sensitive assessment protocols. Results demonstrate that current MLVLMs consistently underperform in balancing cross-lingual generalization with culturally appropriate reasoning. Our contributions include: (1) a theoretically grounded framework clarifying the neutrality–awareness trade-off; (2) an expanded, culturally nuanced evaluation suite; and (3) empirical evidence guiding the development of more culturally aware MLVLMs.

Technology Category

Application Category

📝 Abstract
This survey examines multilingual vision-language models that process text and images across languages. We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
Problem

Research questions and friction points this paper is trying to address.

Examines multilingual vision-language models processing text and images
Identifies tension between language neutrality and cultural awareness
Finds discrepancies in cross-lingual capabilities and evaluation gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual vision-language models process text and images
Training uses contrastive learning for language neutrality
Evaluation benchmarks incorporate culturally grounded content
🔎 Similar Papers
No similar papers found.
A
Andrei-Alexandru Manea
Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
Jindřich Libovický
Jindřich Libovický
Charles University
natural language processingmultilingualityneural machine translationlanguage and vision