Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the challenge of building high-performance, scalable multilingual sparse large language models while maintaining computational efficiency. The authors propose Marco-MoE, a fully open-source multilingual sparse mixture-of-experts (MoE) language model that activates only approximately 5% of its parameters per token. Through efficient upcycling from dense models, pretraining on 5 trillion multilingual tokens, and instruction fine-tuning, Marco-MoE outperforms comparable-scale models on both English and multilingual benchmarks. Notably, its instruction-tuned variant surpasses competing models that activate 3–14 times more parameters. The study further uncovers patterns of cross-lingual sharing and language-specific expert activation, enabling interference-free language expansion. All data, training recipes, and model weights are publicly released.
📝 Abstract
We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.
Problem

Research questions and friction points this paper is trying to address.

multilingual
Mixture-of-Experts
sparse models
language expansion
efficient pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
sparse models
multilingual language models
upcycling
efficient pre-training
🔎 Similar Papers
No similar papers found.