Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality open-source data for multilingual large language model (LLM) pretraining, as well as the poor cross-lingual transferability and limited scalability of existing heuristic filtering methods, this paper proposes JQL—a novel framework that distills LLMs’ quality discrimination capability into a lightweight, multilingual embedding-driven annotator enabling zero-shot cross-lingual data quality assessment. JQL leverages multilingual pretrained embeddings (e.g., mBERT, XLM-R) and integrates contrastive learning with self-supervised quality scoring. Experiments across 35 languages demonstrate that JQL significantly outperforms heuristic baselines such as FineWeb2, yielding substantial improvements in downstream model performance, higher data retention rates, and markedly reduced computational overhead.

Technology Category

Application Category

📝 Abstract
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality multilingual training data for LLMs
Heuristic filtering methods limit cross-lingual transferability and scalability
Need efficient multilingual data curation with reduced computational demands
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses lightweight annotators based on multilingual embeddings
Enhances multilingual data quality and retention rates
Outperforms heuristic filtering methods across languages
🔎 Similar Papers
No similar papers found.
Mehdi Ali
Mehdi Ali
Fraunhofer IAIS, LAMARR Institute
Machine LearningKnowledge GraphsRelational LearningNLPFoundation Models
Manuel Brack
Manuel Brack
Applied Research Scientist @ Adobe | Adjunct Researcher @ hessian.AI
Machine Learning
M
Max Lubbering
Lamarr Institute,Fraunhofer IAIS
E
Elias Wendt
Computer Science Department, TU Darmstadt
A
Abbas Goher Khan
Lamarr Institute
R
Richard Rutmann
Lamarr Institute,Fraunhofer IAIS
A
Alex Jude
Fraunhofer IAIS
Maurice Kraus
Maurice Kraus
TU Darmstadt
A
Alexander Arno Weber
Lamarr Institute,Fraunhofer IAIS
F
Felix Stollenwerk
AI Sweden
D
David Kacz'er
Lamarr Institute
Florian Mai
Florian Mai
Junior Research Group Leader, Uni Bonn
AI alignmentLLM reasoningLLMs
Lucie Flek
Lucie Flek
University of Bonn, Lamarr Institute of Machine Learning and Artificial Intelligence
Natural Language ProcessingMachine LearningPhysicsComputational Social Sciences
R
R. Sifa
Lamarr Institute,Fraunhofer IAIS
N
Nicolas Flores-Herr
Fraunhofer IAIS
J
Joachim Kohler
Lamarr Institute,Fraunhofer IAIS
P
P. Schramowski
DFKI SAINT,Hessian AI,Computer Science Department, TU Darmstadt
Michael Fromm
Michael Fromm
Fraunhofer IAIS
Machine LearningLarge Language ModelsArgument Mining
K
K. Kersting
DFKI SAINT,Hessian AI,Computer Science Department, TU Darmstadt