🤖 AI Summary
This work addresses the challenge of accurately predicting pretraining loss for large language models across varying model scales, batch sizes, and training steps—particularly under dynamically changing batch sizes and extreme extrapolation of compute budgets. To this end, the authors propose a loss prediction model grounded in a noisy quadratic system, which explicitly models test loss as a function of model size $N$, batch size $B$, and number of weight updates $K$. The framework enables joint optimization of training configurations under composite constraints on time, memory, and computational resources. Notably, it achieves high-precision loss prediction in variable-batch settings—a first in the field—and significantly outperforms existing heuristics such as Chinchilla, even when extrapolating up to 1000× beyond observed compute budgets. The recommended $(N, B, K)$ configurations closely align with empirically optimal solutions.
📝 Abstract
We introduce a predictive model that estimates the pre-training loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N, B, K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity. The implementation is available on https://github.com/chuningxdy/Noisy-Quadratic-System.