ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the performance degradation commonly observed in multilingual large language models, which stems from imbalanced data distributions and the so-called “curse of multilinguality.” The authors identify the root cause as remediable corpus quality issues and propose a language-specific data curation and balancing strategy. By integrating multilingual quality evaluation with an efficient training mixture methodology, they optimize the composition of a 20-trillion-token corpus. Models trained on this refined dataset—specifically 3B and 8B parameter variants—achieve state-of-the-art multilingual performance while using 4–10 times fewer FLOPs than competing approaches. Furthermore, the curated corpus significantly enhances the multilingual scaling efficiency of Trinity Large (400B), demonstrating its effectiveness in improving both model performance and training efficiency across diverse languages.

Technology Category

Application Category

📝 Abstract
Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the "curse of multilinguality". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially larger within-language improvements. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within an effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4-10x fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute. Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. These results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.
Problem

Research questions and friction points this paper is trying to address.

multilinguality
data curation
curse of multilinguality
training data imbalance
multilingual interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual data curation
curse of multilinguality
compute-efficient scaling
pretraining corpus
language-specific optimization
🔎 Similar Papers
No similar papers found.
A
Aldo Gael Carranza
K
Kaleigh Mentzer
Ricardo Pio Monti
Ricardo Pio Monti
Gatsby Unit, UCL
A
Alex Fang
A
Alvin Deng
Amro Abbas
Amro Abbas
DatologyAI
Machine LearningNatural Language ProcessingComputer Vision
Anshuman Suri
Anshuman Suri
Postdoctoral scholar, Northeastern University
Machine Learning PrivacyMachine Learning Security
B
Brett Larsen
C
Cody Blakeney
D
Darren Teh
D
David Schwab
D
Diego Kiner
F
Fan Pan
H
Haakon Mongstad
Jack Urbanek
Jack Urbanek
DatologyAI
Artificial Intelligence
J
Jason Lee
J
Jason Telanoff
J
Josh Wills
L
Luke Merrick
Parth Doshi
Parth Doshi
MS in CSE, University of California San Diego
Machine LearningComputer Vision
P
Paul Burstein
Pratyush Maini
Pratyush Maini
Carnegie Mellon University
Trustworthy ML
S
Spandan Das