HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality, large-scale, fine-grained annotated data for multilingual large language models and machine translation research, this work introduces the first open-source, large-scale multilingual text data construction framework. It supports nearly 200 languages and scales to 30 trillion tokens, integrating end-to-end techniques including web page cleaning, noise-robust language identification, exact and fuzzy deduplication, PII detection, registry label annotation, text quality scoring, and synthetic parallel corpus generation. We propose the first native-task-based multilingual evaluation framework, releasing standardized benchmarks across nine languages and an automated assessment pipeline. The resulting dataset constitutes the largest publicly available multilingual pretraining corpus to date. Using it, we train 57 monolingual encoder-decoder models and multiple GPT-style monolingual models. All tooling—including data processing pipelines, evaluation benchmarks, and pretrained model families—is fully open-sourced.

Technology Category

Application Category

📝 Abstract
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
Problem

Research questions and friction points this paper is trying to address.

Creating massive open multilingual datasets for LLM training and machine translation
Developing comprehensive evaluation benchmarks for multilingual language model assessment
Providing high-quality monolingual and bilingual data with rich annotations and filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest multilingual LLM pre-training data collection
Open-source pipeline for document selection and annotation
Comprehensive multilingual evaluation benchmarks and models
🔎 Similar Papers
No similar papers found.
Stephan Oepen
Stephan Oepen
Professor in Language Technologies, Universitetet i Oslo
Human Language TechnologiesNatural Language ProcessingComputational Linguistics
N
Nikolay Arefev
University of Oslo, Department of Informatics
Mikko Aulamo
Mikko Aulamo
Unknown affiliation
M
Marta Bañón
Prompsit Language Engineering
M
Maja Buljan
University of Oslo, Department of Informatics
L
Laurie Burchell
The Common Crawl Foundation
L
Lucas Charpentier
University of Oslo, Department of Informatics
Pinzhen Chen
Pinzhen Chen
University of Edinburgh
large language modelsLLM post-trainingmachine translationmultilinguality
M
Mariya Fedorova
University of Oslo, Department of Informatics
Ona de Gibert
Ona de Gibert
PhD Student @ University of Helsinki
Machine TranslationMultilingualityKnowledge Distillation
Barry Haddow
Barry Haddow
University of Edinburgh
NLPmachine translationspoken language translationinformation extraction
J
Jan Hajič
Charles University, Prague, Institute of Formal and Applied Linguistics
J
Jindrič Helcl
University of Oslo, Department of Informatics
Andrey Kutuzov
Andrey Kutuzov
University of Oslo
Computational LinguisticsNatural Language ProcessingDiachronic Word EmbeddingsSemantic Change DetectionMachine Learning
Z
Zihao Li
University of Helsinki, Department of Digital Humanities
R
Risto Luukkonen
TurkuNLP, University of Turku, Department of Computing
Bhavitvya Malik
Bhavitvya Malik
Research Assistant, University of Edinburgh
natural language processingspeech
Vladislav Mikhailov
Vladislav Mikhailov
University of Oslo
LLMNLPbenchmarking
A
Amanda Myntti
TurkuNLP, University of Turku, Department of Computing
Dayyán O'Brien
Dayyán O'Brien
University of Edinburgh
Natural language processing
Lucie Poláková
Lucie Poláková
Charles University in Prague
Discourse AnalysisDependency SyntaxLanguage ResourcesComputational Linguistics
Sampo Pyysalo
Sampo Pyysalo
University of Turku
G
Gema Ramírez Sánchez
Prompsit Language Engineering
J
Janine Siewert
University of Helsinki, Department of Digital Humanities
P
Pavel Stepachev
Edinburgh University, School of Informatics