Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data curriculum design in large language model (LLM) training lacks theoretical foundations. Method: We propose the Training Re-evaluation Curve (TREC), a novel analytical tool that retrospectively evaluates the retention efficacy of each training batch by re-assessing data on the final model weights. We further introduce the first *prospective* TREC prediction method, leveraging the implicit exponential moving average (EMA) coefficient inherent in the AdamW optimizer—enabling accurate TREC shape estimation without additional computation. Contribution/Results: We discover that high-quality data should be strategically placed at TREC troughs to significantly improve model performance. Validated across models ranging from 111M to 3.9B parameters and on a 900B-token pretraining corpus, this paradigm enhances downstream task performance and unifies the mechanistic explanations of diverse existing training strategies—including progressive learning, data mixing, and token-level scheduling—establishing a new framework for data curricula that is interpretable, predictable, and optimization-friendly.

Technology Category

Application Category

📝 Abstract
Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamW's implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.
Problem

Research questions and friction points this paper is trying to address.

Predicting training re-evaluation curves to optimize data placement strategies
Determining optimal timing for high-quality data during LLM training
Enabling proactive curriculum design using AdamW's implicit EMA coefficients
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces training re-evaluation curve diagnostic method
Predicts TREC using AdamW's implicit EMA coefficients
Aligns high-quality data with TREC minima for optimization
🔎 Similar Papers
No similar papers found.