Behind Maya: Building a Multilingual Vision Language Model

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) exhibit poor generalization in low-resource language and multicultural settings. To address this, we introduce Maya, an open-source multilingual VLM designed to enhance fine-grained cross-lingual and cross-cultural understanding. Methodologically: (1) we construct the first multilingual image-text pretraining dataset covering eight languages; (2) we extend the LLaVA framework with multilingual instruction tuning, cross-lingual contrastive learning, and culture-aware image annotation; and (3) we design a lightweight unified architecture enabling joint cross-lingual and cross-cultural alignment. Experiments demonstrate that Maya significantly outperforms monolingual baselines across multilingual visual question answering, image captioning, and cross-modal retrieval tasks. Notably, it achieves over 32% absolute accuracy gains on low-resource languages—including Hindi and Arabic—highlighting its robustness in linguistically and culturally diverse scenarios.

Technology Category

Application Category

📝 Abstract
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
Problem

Research questions and friction points this paper is trying to address.

Addressing low-resource language performance in VLMs
Enhancing cultural context understanding in vision-language tasks
Developing a multilingual VLM with diverse language support
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual image-text pretraining dataset in eight languages
Open-source Multilingual Vision-Language Model (Maya)
Enhanced cultural and linguistic comprehension in vision tasks
🔎 Similar Papers
No similar papers found.
N
Nahid Alam
K
Karthik Reddy Kanjula
S
Surya Guthikonda
T
Timothy Chung
B
Bala Krishna S Vegesna
A
Abhipsha Das
A
Anthony Susevski
Ryan Sze-Yin Chan
Ryan Sze-Yin Chan
The Alan Turing Institute
S
S. M. I. Uddin
S
Shayekh Bin Islam
R
Roshan Santhosh
A
A Snegha
D
Drishti Sharma
C
Chen Liu
I
Isha Chaturvedi
Genta Indra Winata
Genta Indra Winata
Capital One AI Foundations
MultilingualityLanguage ModelingMultimodalLow-resource NLPCode-Switching
A
Ashvanth.S
S
Snehanshu Mukherjee
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation