Grounding Multilingual Multimodal LLMs With Cultural Knowledge

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Multimodal large language models exhibit significant biases in understanding low-resource languages and long-tail cultural entities. To address this, we propose a culture-aware multilingual multimodal alignment framework. We construct CulturalGround, a high-quality, culturally grounded visual question answering dataset comprising 22 million samples across 39 languages and 42 countries. CulturalGround is built by retrieving culturally relevant images from Wikidata and generating synthetic multilingual VQA instances. Leveraging this dataset, we train CulturalPangea—an open-source multimodal model enabling fine-grained, cross-cultural and cross-lingual visual–cultural grounding for the first time. Experiments demonstrate that CulturalPangea achieves an average +5.0% improvement over existing open-source methods on culture-focused multilingual multimodal benchmarks, while preserving full performance on mainstream vision-language tasks. This work substantially narrows the cultural gap in multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

Address misinterpretation of cultural entities in MLLMs

Improve performance in low-resource language settings

Bridge cultural knowledge gap in multimodal systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Wikidata knowledge graph for cultural data

Generates synthetic multilingual visual question answering data

Trains MLLM with interleaved instruction-tuning data

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning