DataDecide: How to Predict Best Pretraining Data with Small Experiments

๐Ÿ“… 2025-04-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the high computational cost of data selection in large language model (LLM) pretraining. We propose an efficient method to predict optimal pretraining data for large models (e.g., 1B parameters) using small-scale experiments (e.g., 150M-parameter models). Methodologically, we first systematically demonstrate that ranking data quality based on single-scale small-model performance achieves ~80% cross-scale predictive accuracy for large models; we further introduce a continuous likelihood metric, attaining >80% prediction capability at only 0.01% computational overhead. Our contributions include: (1) releasing DataDecideโ€”the first open-source, systematic data-and-scale joint evaluation suite, covering 25 data sources, 100B tokens, 3 random seeds, and 5 zero-shot benchmarks; and (2) establishing single-scale prediction as the current cost-optimal baseline for data-centric LLM pretraining decisions.

Technology Category

Application Category

๐Ÿ“ Abstract
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval>80% predictable at the target 1B scale with just 0.01% of the compute.
Problem

Research questions and friction points this paper is trying to address.

Predict best pretraining data using small-scale experiments
Evaluate benchmarks for accurate large model performance prediction
Assess scaling laws and metrics for cost-effective model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Small-scale experiments predict best pretraining data
Single small model ranking predicts larger model performance
Continuous likelihood metrics enable high predictability efficiently
๐Ÿ”Ž Similar Papers
No similar papers found.