Aria: An Open Multimodal Native Mixture-of-Experts Model

📅 2024-10-08
🏛️ arXiv.org
📈 Citations: 13
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source multimodal large language models (MLLMs) suffer from architectural rigidity and limited capability in integrating real-world multimodal information. To address these limitations, this work introduces the first natively multimodal Mixture-of-Experts (MoE) model with full open-source release. We propose a four-stage progressive pretraining paradigm—spanning language-only, vision-language joint, long-context, and instruction-tuning phases—and design a dual-path parameter activation mechanism that independently governs visual and textual expert routing. The model achieves efficient multimodal understanding under sparse activation of 3.9B/3.5B parameters. All model weights and adaptation code are publicly released. Extensive experiments demonstrate state-of-the-art performance among open models: our method surpasses Pixtral-12B and Llama-3.2-11B across multimodal understanding, language modeling, and code generation benchmarks, while matching the performance of leading proprietary models—thereby significantly advancing open multimodal AI research and deployment.

Technology Category

Application Category

📝 Abstract
Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Multimodal AI Models
Transparency and Accessibility
Real-world Information Integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Ensemble Model
Enhanced Understanding
Accessible Modifiability
🔎 Similar Papers
No similar papers found.