BYOL: Bring Your Own Language Into LLMs

πŸ“… 2026-01-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the poor performance of large language models on low-resource and extremely low-resource languages, which stems from data scarcity and cultural misalignment. The authors propose the BYOL framework, which introduces a novel four-tier classification of languages based on their digital footprint. For low-resource languages, they design an end-to-end pipeline encompassing data cleaning, synthetic data generation, continued pretraining, and supervised fine-tuning. For extremely low-resource languages, they innovatively employ a translation-mediated adaptation pathway and integrate model weight fusion to balance local linguistic performance with multilingual generalization. Experiments demonstrate an average 12% performance gain on Chichewa and Māori and a 4-point BLEU improvement for Inuktitut translation. The study also releases the trilingual Global MMLU-Lite benchmark alongside open-sourced code and models.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
language imbalance
multilingual LLMs
extreme-low-resource languages
language accessibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

BYOL
low-resource languages
language-aware LLM
model merging
translation-mediated inclusion
πŸ”Ž Similar Papers
No similar papers found.
Syed Waqas Zamir
Syed Waqas Zamir
Sr. Research Scientist @ Microsoft AI for Good Lab
Computer VisionGenerative AILow-level VisionDeep LearningImage Restoration
W
W. Hamidouche
Microsoft AI for Good Research Lab
B
B. Amor
Inception, G42
L
Luana Marotti
Microsoft AI for Good Research Lab
I
I. Becker-Reshef
Microsoft AI for Good Research Lab
J
J. L. Ferres
Microsoft AI for Good Research Lab