🤖 AI Summary
Current multilingual large language model (LLM) pretraining suffers from a lack of principled, fine-grained filtering methods for non-English data, leading to suboptimal data quality and coverage. Method: We propose the first transparent, lightweight, and scalable model-driven multilingual data selection framework, leveraging a dual-path classifier integrating Transformer-based semantic and FastText-based structural features, coupled with a cross-lingual controllable sampling strategy. The framework systematically covers 20 languages—including low-resource languages and scripts with multiple writing systems—balancing linguistic diversity and knowledge density. Contribution/Results: We release a high-quality, refined multilingual pretraining dataset. Experiments show that models trained on only 15% of the original token count achieve baseline MMLU performance; significant gains are observed on MMLU, XWinograd, and other multilingual benchmarks. Generalizability is further validated on the FineWeb-2 multilingual subset.
📝 Abstract
Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.