Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the learnability gap in scientific data during pretraining, stemming from the absence of a systematic, high-quality processing framework. To bridge this gap, the authors propose Data Darwinism, a ten-level taxonomy (L0–L9) for scientific data refinement, leveraging generative refinement (L4) and cognitive completion (L5) to enhance raw scientific text. They construct the Darwin-Science corpus comprising 900B tokens and introduce a novel data-model co-evolution framework to enable from-scratch pretraining and continued training over 600B tokens for the daVinci-origin-3B/7B models. Evaluated across more than 20 benchmarks, the approach yields average gains of 2.12–2.95 points, with up to 8.40 points improvement on domain-specific tasks; notably, L5 processing alone contributes 1.36 points, validating the efficacy of advanced data curation and establishing an uncontaminated evaluation baseline.

Technology Category

Application Category

📝 Abstract

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

Problem

Research questions and friction points this paper is trying to address.

scientific data

data quality

pre-training

learnability gap

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Darwinism

Generative Refinement

Cognitive Completion