Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

📅 2024-09-25
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Traditional pretraining corpus quality relies on rigid, hand-crafted rules, hindering sample-wise adaptive optimization. This paper proposes ProX, the first framework to model data refinement as a programmable task: it employs a lightweight small language model (0.3B parameters) to automatically generate and execute fine-grained string operations (e.g., normalization) per sample, enabling expert-level, fully automated data quality enhancement. ProX requires no domain-specific customization and supports heterogeneous corpora (e.g., C4, RedPajama, FineWeb), incorporating FLOPs-aware training optimization. Experiments show that ProX-pretrained models achieve an average +2.0% improvement on downstream tasks; on OpenWebMath, they outperform Mistral-7B by 7.6–20.3%. Remarkably, ProX attains the performance of a 200B-token baseline using only 10B tokens. The project releases >500B high-quality refined tokens, pretrained models, and full implementation.

Technology Category

Application Category

📝 Abstract
Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training. We are open-sourcing ProX with>500B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX
Problem

Research questions and friction points this paper is trying to address.

Enhances pre-training data quality efficiently
Automates tailored data refinement for individual examples
Reduces training FLOPs for LLM pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Small models enhance data quality
ProX automates fine-grained data refinement
ProX boosts domain-specific training efficiency
🔎 Similar Papers
No similar papers found.
F
Fan Zhou
Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
Zengzhi Wang
Zengzhi Wang
Shanghai Jiao Tong University
Data EngineeringComplex ReasoningLarge Language ModelsNatural Language Processing
Q
Qian Liu
Sea AI Lab, Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
J
Junlong Li
Shanghai Jiao Tong University, Generative AI Research Lab (GAIR)
P
Pengfei Liu
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Generative AI Research Lab (GAIR)