🤖 AI Summary
It remains unclear whether training loss curves of large language models (LLMs) collapse onto a single universal trajectory under practical, data-budget-constrained scaling—simultaneously varying width, depth, learning rate, batch size, and weight decay.
Method: The authors empirically investigate loss curve collapse across diverse LLM configurations trained under fixed compute budgets. They propose two novel techniques grounded in collapse behavior: (i) an early anomaly detection mechanism based on deviations from the collapsed trajectory, and (ii) a principled early-stopping criterion for hyperparameter search leveraging collapse consistency.
Contribution/Results: Under optimal hyperparameters, loss curves exhibit strong collapse—a robust indicator of computationally efficient training. Validated across the Celerity model family—built from empirical scaling laws—the phenomenon proves broadly generalizable in real training. The proposed methods accelerate hyperparameter optimization by multiple-fold, significantly improving training predictability, diagnostic capability, and resource efficiency.
📝 Abstract
Effective LLM training relies on *consistency*, meaning that key quantities -- such as final losses and optimal hyperparameters -- scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.