🤖 AI Summary
To address the scarcity of high-quality open-source data for multilingual large language model (LLM) pretraining, as well as the poor cross-lingual transferability and limited scalability of existing heuristic filtering methods, this paper proposes JQL—a novel framework that distills LLMs’ quality discrimination capability into a lightweight, multilingual embedding-driven annotator enabling zero-shot cross-lingual data quality assessment. JQL leverages multilingual pretrained embeddings (e.g., mBERT, XLM-R) and integrates contrastive learning with self-supervised quality scoring. Experiments across 35 languages demonstrate that JQL significantly outperforms heuristic baselines such as FineWeb2, yielding substantial improvements in downstream model performance, higher data retention rates, and markedly reduced computational overhead.
📝 Abstract
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.