All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Contemporary large multimodal models (LMMs) suffer from narrow cultural coverage, weak support for low-resource languages, and insufficient cross-cultural visual–linguistic reasoning capabilities. To address these limitations, we introduce ALM-bench—the first multimodal evaluation benchmark covering 100 languages (including numerous low-resource ones) and 13 cultural dimensions. Methodologically, we propose a hierarchical question-type design (true/false, multiple-choice, open-ended QA), integrate human-annotated multilingual image–text pairs, employ cultural knowledge graphs to guide content sampling, and establish a standardized evaluation protocol enabling multi-granularity assessment. Comprehensive experiments on leading open- and closed-source LMMs systematically expose their significant performance deficits on low-resource language understanding and culture-specific reasoning tasks—revealing these shortcomings for the first time at scale. This work advances the development of globally accessible, culturally inclusive LMMs and provides both a novel paradigm and foundational infrastructure for cross-cultural multimodal understanding research.

Technology Category

Application Category

📝 Abstract

Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model's ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs on 100 culturally diverse languages

Assessing LMMs' understanding of cultural contexts and low-resource languages

Testing LMMs' visual and linguistic reasoning across varied cultural aspects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LMMs on 100 culturally diverse languages

Integrates visual cues with low-resource language support

Uses diverse question formats for nuanced evaluation

🔎 Similar Papers

No similar papers found.