Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of designing effective data processing strategies for large-scale pretraining corpora, which span hundreds of heterogeneous categories and render manual curation costly and non-scalable. To this end, we propose DataEvolve, a novel framework that enables the automated evolution of pretraining data processing pipelines. DataEvolve employs a closed-loop iterative mechanism integrating problem identification, strategy generation, execution evaluation, and cross-generation optimization, supported by experience and strategy pools to accumulate knowledge across iterations. The framework incorporates domain-aware modules for data cleaning, format normalization, quality assessment, and strategy performance tracking. Evaluated on a 672B-token raw corpus, DataEvolve produces the 504B-token Darwin-CC dataset, yielding a 3B-parameter model that achieves an average score of 44.13 across 18 benchmarks—significantly outperforming existing methods, with notable gains on knowledge-intensive tasks such as MMLU.

Technology Category

Application Category

📝 Abstract
Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.
Problem

Research questions and friction points this paper is trying to address.

data curation
pretraining data
automated strategy evolution
heterogeneous data categories
data processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

DataEvolve
automated data curation
evolutionary optimization
pretraining data
strategy evolution
🔎 Similar Papers
No similar papers found.
T
Tiantian Mi
SII, FDU, SJTU, GAIR
D
Dongming Shan
KPS
Z
Zhen Huang
SII, FDU, SJTU, GAIR
Y
Yiwei Qin
SII, SJTU, GAIR
M
Muhang Xie
SII, GAIR
Y
Yuxuan Qiao
GAIR
Yixiu Liu
Yixiu Liu
Master student at Shanghai Jiao Tong University
Chenyang Zhou
Chenyang Zhou
ASU
P
Pengfei Liu
SII, SJTU, GAIR