🤖 AI Summary
This work addresses the challenge of designing effective data processing strategies for large-scale pretraining corpora, which span hundreds of heterogeneous categories and render manual curation costly and non-scalable. To this end, we propose DataEvolve, a novel framework that enables the automated evolution of pretraining data processing pipelines. DataEvolve employs a closed-loop iterative mechanism integrating problem identification, strategy generation, execution evaluation, and cross-generation optimization, supported by experience and strategy pools to accumulate knowledge across iterations. The framework incorporates domain-aware modules for data cleaning, format normalization, quality assessment, and strategy performance tracking. Evaluated on a 672B-token raw corpus, DataEvolve produces the 504B-token Darwin-CC dataset, yielding a 3B-parameter model that achieves an average score of 44.13 across 18 benchmarks—significantly outperforming existing methods, with notable gains on knowledge-intensive tasks such as MMLU.
📝 Abstract
Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.