Poro 34B and the Blessing of Multilinguality

📅 2024-04-02

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address the scarcity of pretraining data for low-resource languages, this work proposes the “multilingualism as a blessing” paradigm and introduces Poro, a 34B-parameter multilingual large language model pretrained jointly on English (high-resource), Finnish (low-resource), and programming language corpora totaling one trillion tokens. Departing from dominant monolingual-first strategies, we provide the first empirical evidence that multilingual co-training simultaneously enhances low-resource language understanding and generation, cross-lingual translation, and general-purpose task performance. Experiments demonstrate that Poro significantly outperforms existing models on Finnish benchmarks, achieves high-quality machine translation, and attains state-of-the-art performance in English and code generation at its scale. To foster reproducibility and community advancement, we publicly release the complete model weights, training datasets, and code.

Technology Category

Application Category

📝 Abstract

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing and that it should be possible to substantially improve over the capabilities of monolingual models for small languages through multilingual training. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that not only substantially advances over the capabilities of existing models for Finnish, but also excels in translation and is competitive in its class in generating English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of training data for most languages

Explores multilingual training to enhance model capabilities

Demonstrates Poro 34B's superior performance in Finnish and translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual training with Finnish, English, programming languages

34B parameter model trained on 1T tokens

Open release of model, scripts, and data

🔎 Similar Papers

No similar papers found.