BYOL: Bring Your Own Language Into LLMs

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the poor performance of large language models on low-resource and extremely low-resource languages, which stems from data scarcity and cultural misalignment. The authors propose the BYOL framework, which introduces a novel four-tier classification of languages based on their digital footprint. For low-resource languages, they design an end-to-end pipeline encompassing data cleaning, synthetic data generation, continued pretraining, and supervised fine-tuning. For extremely low-resource languages, they innovatively employ a translation-mediated adaptation pathway and integrate model weight fusion to balance local linguistic performance with multilingual generalization. Experiments demonstrate an average 12% performance gain on Chichewa and Māori and a 4-point BLEU improvement for Inuktitut translation. The study also releases the trilingual Global MMLU-Lite benchmark alongside open-sourced code and models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

language imbalance

multilingual LLMs

extreme-low-resource languages

language accessibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

BYOL

low-resource languages

language-aware LLM