An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Constructing high-quality multilingual training datasets remains a critical bottleneck for large language models and machine translation, especially for low-resource languages suffering from data scarcity, high noise levels, and poor transparency. To address this, we introduce HPLT v2—the first open-source, highly transparent, ultra-large-scale multilingual corpus, covering 193 languages (8 trillion monolingual tokens) and 51 language pairs (380 million parallel sentence pairs). We design an end-to-end cleaning pipeline integrating strict deduplication, multi-granularity language identification, toxicity filtering, and factual consistency verification, validated via statistical analysis and human sampling. HPLT v2 substantially improves multilingual language modeling capabilities and advances machine translation performance: on WMT benchmarks, it yields an average +2.1 BLEU gain across 40+ low-resource languages. This work fills a crucial gap by providing a high-quality, reproducible, and auditable foundational multilingual dataset.

Technology Category

Application Category

📝 Abstract

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Problem

Research questions and friction points this paper is trying to address.

Addresses the challenge of building multilingual datasets for language models.

Introduces HPLT v2, a high-quality multilingual dataset with extensive coverage.

Evaluates performance of models trained on HPLT v2 for language technologies.

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality multilingual monolingual corpora

Extensive parallel corpora for 51 languages

Open-source data pipeline and reproducibility code

🔎 Similar Papers

No similar papers found.