Omnilingual MT: Machine Translation for 1,600 Languages

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant gap between the limited language coverage of current machine translation systems—spanning only around 200 languages—and the needs of over 7,000 spoken languages worldwide, compounded by the absence of reliable evaluation benchmarks. The authors present a high-quality translation system supporting more than 1,600 languages, integrating large-scale public corpora with a newly curated, manually vetted bilingual dataset called MeDLEY. They propose two specialized large language model architectures: a decoder-only variant (OMT-LLaMA) and an encoder-decoder variant (OMT-NLLB). These models, operating at 1B–8B parameters, outperform 70B-parameter general-purpose LLMs, markedly improving generation coherence and cross-lingual understanding for low-resource languages. Additionally, the study introduces BOUQuET and Met-BOUQuET, dynamic and evolving evaluation benchmarks that establish a new paradigm for multilingual translation research.

Technology Category

Application Category

📝 Abstract
High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.
Problem

Research questions and friction points this paper is trying to address.

machine translation
low-resource languages
language coverage
multilingual systems
evaluation benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omnilingual Machine Translation
low-resource languages
model specialization
cross-lingual transfer
multilingual NMT
🔎 Similar Papers
No similar papers found.
O
Omnilingual MT Team
FAIR at Meta
Belen Alastruey
Belen Alastruey
Meta AI
Natural Language ProcessingSpeech ProcessingInterpretability
Niyati Bafna
Niyati Bafna
Johns Hopkins University, Center for Language and Speech Processing
Low-resource NLPLarge Language ModellingMachine TranslationBilingual Lexicon Induction
A
Andrea Caciolai
FAIR at Meta
K
Kevin Heffernan
FAIR at Meta
A
Artyom Kozhevnikov
FAIR at Meta
C
Christophe Ropers
FAIR at Meta
E
Eduardo Sánchez
FAIR at Meta
C
Charles-Eric Saint-James
FAIR at Meta
Ioannis Tsiamas
Ioannis Tsiamas
UPC Barcelona & Meta FAIR
multilingualitymultimodalityspeech translationmachine translation
C
Chierh Cheng
FAIR at Meta
J
Joe Chuang
FAIR at Meta
Paul-Ambroise Duquenne
Paul-Ambroise Duquenne
Meta AI, FAIR
NLPSpeech ProcessingSpeech TranslationMachine TranslationMachine Learning
M
Mark Duppenthaler
FAIR at Meta
N
Nate Ekberg
FAIR at Meta
C
Cynthia Gao
FAIR at Meta
P
Pere Lluís Huguet Cabot
FAIR at Meta
J
João Maria Janeiro
FAIR at Meta
Jean Maillard
Jean Maillard
Meta AI
Natural Language ProcessingComputational LinguisticsMachine LearningDeep Learning
G
Gabriel Mejia Gonzalez
FAIR at Meta
Holger Schwenk
Holger Schwenk
Research scientist, Facebook, and professor of Computer Science
NLPstatistical machine translationmachine learningneural networks
Edan Toledo
Edan Toledo
Meta & UCL
Reinforcement LearningNatural Language ProcessingMulti Agent Reinforcement Learning
A
Arina Turkatenko
FAIR at Meta
Albert Ventayol-Boada
Albert Ventayol-Boada
Meta AI
linguisticsNLPcomputational methodsdiscourse and grammarlanguage description
R
Rashel Moritz
FAIR at Meta