Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Current multilingual large language model (LLM) pretraining suffers from a lack of principled, fine-grained filtering methods for non-English data, leading to suboptimal data quality and coverage. Method: We propose the first transparent, lightweight, and scalable model-driven multilingual data selection framework, leveraging a dual-path classifier integrating Transformer-based semantic and FastText-based structural features, coupled with a cross-lingual controllable sampling strategy. The framework systematically covers 20 languages—including low-resource languages and scripts with multiple writing systems—balancing linguistic diversity and knowledge density. Contribution/Results: We release a high-quality, refined multilingual pretraining dataset. Experiments show that models trained on only 15% of the original token count achieve baseline MMLU performance; significant gains are observed on MMLU, XWinograd, and other multilingual benchmarks. Generalizability is further validated on the FineWeb-2 multilingual subset.

Technology Category

Application Category

📝 Abstract

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual LLM pretraining efficiency

Addressing data selection disparity in non-English languages

Improving dataset curation with model-based filtering techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-based filtering for multilingual datasets

Transformer- and FastText-based classifiers

Efficient training with reduced token usage

🔎 Similar Papers

No similar papers found.