🤖 AI Summary
Existing vision-language models (VLMs) exhibit poor generalization in low-resource language and multicultural settings. To address this, we introduce Maya, an open-source multilingual VLM designed to enhance fine-grained cross-lingual and cross-cultural understanding. Methodologically: (1) we construct the first multilingual image-text pretraining dataset covering eight languages; (2) we extend the LLaVA framework with multilingual instruction tuning, cross-lingual contrastive learning, and culture-aware image annotation; and (3) we design a lightweight unified architecture enabling joint cross-lingual and cross-cultural alignment. Experiments demonstrate that Maya significantly outperforms monolingual baselines across multilingual visual question answering, image captioning, and cross-modal retrieval tasks. Notably, it achieves over 32% absolute accuracy gains on low-resource languages—including Hindi and Arabic—highlighting its robustness in linguistically and culturally diverse scenarios.
📝 Abstract
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.