An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Constructing high-quality multilingual training datasets remains a critical bottleneck for large language models and machine translation, especially for low-resource languages suffering from data scarcity, high noise levels, and poor transparency. To address this, we introduce HPLT v2—the first open-source, highly transparent, ultra-large-scale multilingual corpus, covering 193 languages (8 trillion monolingual tokens) and 51 language pairs (380 million parallel sentence pairs). We design an end-to-end cleaning pipeline integrating strict deduplication, multi-granularity language identification, toxicity filtering, and factual consistency verification, validated via statistical analysis and human sampling. HPLT v2 substantially improves multilingual language modeling capabilities and advances machine translation performance: on WMT benchmarks, it yields an average +2.1 BLEU gain across 40+ low-resource languages. This work fills a crucial gap by providing a high-quality, reproducible, and auditable foundational multilingual dataset.

Technology Category

Application Category

📝 Abstract
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
Problem

Research questions and friction points this paper is trying to address.

Addresses the challenge of building multilingual datasets for language models.
Introduces HPLT v2, a high-quality multilingual dataset with extensive coverage.
Evaluates performance of models trained on HPLT v2 for language technologies.
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality multilingual monolingual corpora
Extensive parallel corpora for 51 languages
Open-source data pipeline and reproducibility code
🔎 Similar Papers
No similar papers found.
L
Laurie Burchell
University of Edinburgh
Ona de Gibert
Ona de Gibert
PhD Student @ University of Helsinki
Machine TranslationMultilingualityKnowledge Distillation
Nikolay Arefyev
Nikolay Arefyev
Postdoctoral Research Fellow, Department of Informatics, University of Oslo
Natural Language ProcessingMachine LearningArtificial IntellingenceWord Sense InductionLexical Semantic Change Detection
Mikko Aulamo
Mikko Aulamo
Unknown affiliation
M
Marta Bañón
Prompsit Language Engineering
Pinzhen Chen
Pinzhen Chen
University of Edinburgh
large language modelsLLM post-trainingmachine translationmultilinguality
Mariia Fedorova
Mariia Fedorova
University of Oslo
NLP
Liane Guillou
Liane Guillou
The University of Edinburgh
Machine TranslationEvaluationUncertainty
Barry Haddow
Barry Haddow
University of Edinburgh
NLPmachine translationspoken language translationinformation extraction
J
Jan Hajič
Charles University
J
Jindřich Helcl
Charles University
Erik Henriksson
Erik Henriksson
Postdoctoral researcher, University of Turku
Mateusz Klimaszewski
Mateusz Klimaszewski
PhD Student, Warsaw University of Technology
natural language processingmachine learningmachine translation
V
Ville Komulainen
University of Turku
Andrey Kutuzov
Andrey Kutuzov
University of Oslo
Computational LinguisticsNatural Language ProcessingDiachronic Word EmbeddingsSemantic Change DetectionMachine Learning
J
Joona Kytöniemi
University of Turku
V
Veronika Laippala
University of Turku
Petter Mæhlum
Petter Mæhlum
UiO
språkteknologitypologihistorisk lingvistikksentimentanalyseannotasjon
Bhavitvya Malik
Bhavitvya Malik
Research Assistant, University of Edinburgh
natural language processingspeech
Farrokh Mehryary
Farrokh Mehryary
University of Turku
Natural Language Processingtext miningdeep learningbioinformatics
Vladislav Mikhailov
Vladislav Mikhailov
University of Oslo
LLMNLPbenchmarking
Nikita Moghe
Nikita Moghe
Student, University of Edinburgh
Natural Language ProcessingMachine Learning
A
Amanda Myntti
University of Turku
Dayyán O'Brien
Dayyán O'Brien
University of Edinburgh
Natural language processing
Stephan Oepen
Stephan Oepen
Professor in Language Technologies, Universitetet i Oslo
Human Language TechnologiesNatural Language ProcessingComputational Linguistics
P
Proyag Pal
University of Edinburgh
J
Jousia Piha
University of Turku
Sampo Pyysalo
Sampo Pyysalo
University of Turku
G
Gema Ramírez-Sánchez
Prompsit Language Engineering
David Samuel
David Samuel
Language Technology Group, University of Oslo
language modelingsemantic parsingnatural language processing
P
Pavel Stepachev
University of Edinburgh
Jörg Tiedemann
Jörg Tiedemann
Professor of Language Technology, University of Helsinki
computational linguisticsmachine translationmachine learningnatural language processinginformation retrieval
Dušan Variš
Dušan Variš
research assistant, Charles University, Institute of Formal and Applied Linguistics
deep learningmachine translationnatural language processing
T
Tereza Vojtěchová
Charles University
J
Jaume Zaragoza-Bernabeu
Prompsit Language Engineering