🤖 AI Summary
This work identifies a fundamental tension in multilingual vision-language models (MLVLMs) between language neutrality—i.e., cross-lingual representational consistency—and cultural awareness—i.e., contextual adaptability. To investigate this, we systematically evaluate 31 MLVLMs across 21 benchmarks and propose a novel “language neutrality–cultural awareness” binary analytical framework. We uncover a structural misalignment: ~67% of existing benchmarks prioritize translation-equivalent semantic consistency while neglecting culturally grounded understanding. Methodologically, we introduce a contrastive learning paradigm that jointly leverages translation-aligned evaluation and emerging culture-sensitive assessment protocols. Results demonstrate that current MLVLMs consistently underperform in balancing cross-lingual generalization with culturally appropriate reasoning. Our contributions include: (1) a theoretically grounded framework clarifying the neutrality–awareness trade-off; (2) an expanded, culturally nuanced evaluation suite; and (3) empirical evidence guiding the development of more culturally aware MLVLMs.
📝 Abstract
This survey examines multilingual vision-language models that process text and images across languages. We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.