Behind Maya: Building a Multilingual Vision Language Model

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing vision-language models (VLMs) exhibit poor generalization in low-resource language and multicultural settings. To address this, we introduce Maya, an open-source multilingual VLM designed to enhance fine-grained cross-lingual and cross-cultural understanding. Methodologically: (1) we construct the first multilingual image-text pretraining dataset covering eight languages; (2) we extend the LLaVA framework with multilingual instruction tuning, cross-lingual contrastive learning, and culture-aware image annotation; and (3) we design a lightweight unified architecture enabling joint cross-lingual and cross-cultural alignment. Experiments demonstrate that Maya significantly outperforms monolingual baselines across multilingual visual question answering, image captioning, and cross-modal retrieval tasks. Notably, it achieves over 32% absolute accuracy gains on low-resource languages—including Hindi and Arabic—highlighting its robustness in linguistically and culturally diverse scenarios.

Technology Category

Application Category

📝 Abstract

In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.

Problem

Research questions and friction points this paper is trying to address.

Addressing low-resource language performance in VLMs

Enhancing cultural context understanding in vision-language tasks

Developing a multilingual VLM with diverse language support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual image-text pretraining dataset in eight languages

Open-source Multilingual Vision-Language Model (Maya)

Enhanced cultural and linguistic comprehension in vision tasks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs