EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

📅 2024-09-26

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the insufficient coverage and weak performance of low-resource languages in multilingual models, this paper introduces EMMA-500—the first large-scale multilingual continual pretraining model supporting 546 languages. Methodologically, it builds upon the Llama 2-7B architecture and constructs MaLA, a high-quality, multi-domain multilingual dataset. It pioneers a massive-scale multilingual continual pretraining paradigm, augmented with cross-lingual vocabulary expansion and position encoding adaptation. The contributions are threefold: (1) substantial improvements in low-resource language understanding and generation, achieving an average +23.6% gain across multilingual benchmarks; (2) open release of the MaLA dataset, model weights, training scripts, and generated samples; and (3) advancement of reproducible research and practical deployment in multilingual NLP.

Technology Category

Application Category

📝 Abstract

In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual performance across 546 languages

Improving language coverage for low-resource languages

Expanding large language models' language capacity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual model EMMA-500

MaLA corpus compilation

Continual pre-training Llama 2

🔎 Similar Papers

No similar papers found.