Aya Vision: Advancing the Frontier of Multilingual Multimodality

📅 2025-05-13
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Multilingual multimodal large language models (MLLMs) face three key challenges: (1) difficulty in cross-lingual vision–language alignment, (2) scarcity of high-quality multilingual multimodal instruction data, and (3) degradation of pure-text capabilities upon visual modality integration. To address these, we propose: (1) a novel synthetic annotation framework to generate high-fidelity multilingual multimodal instruction data; (2) a cross-modal model fusion mechanism that enhances visual understanding while explicitly preserving textual reasoning capabilities; and (3) a two-stage training paradigm combining multilingual vision–language alignment pretraining with capability-aware instruction fine-tuning. Experiments demonstrate that Aya-Vision-8B outperforms Qwen-2.5-VL-7B, while Aya-Vision-32B surpasses Molmo-72B and LLaMA-3.2-90B-Vision—models 2.25× larger—achieving superior cross-lingual multimodal comprehension consistency and significantly mitigating catastrophic forgetting.

Technology Category

Application Category

📝 Abstract
Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.
Problem

Research questions and friction points this paper is trying to address.

Aligning vision and language modalities in multilingual models
Mitigating catastrophic forgetting in multimodal multilingual settings
Addressing data scarcity for multilingual multimodal instruction data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic annotation framework for multilingual multimodal data
Cross-modal model merging to prevent catastrophic forgetting
Scalable approach outperforming larger models efficiently
🔎 Similar Papers
No similar papers found.