Scaling with Collapse: Efficient and Predictable Training of LLM Families

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

It remains unclear whether training loss curves of large language models (LLMs) collapse onto a single universal trajectory under practical, data-budget-constrained scaling—simultaneously varying width, depth, learning rate, batch size, and weight decay. Method: The authors empirically investigate loss curve collapse across diverse LLM configurations trained under fixed compute budgets. They propose two novel techniques grounded in collapse behavior: (i) an early anomaly detection mechanism based on deviations from the collapsed trajectory, and (ii) a principled early-stopping criterion for hyperparameter search leveraging collapse consistency. Contribution/Results: Under optimal hyperparameters, loss curves exhibit strong collapse—a robust indicator of computationally efficient training. Validated across the Celerity model family—built from empirical scaling laws—the phenomenon proves broadly generalizable in real training. The proposed methods accelerate hyperparameter optimization by multiple-fold, significantly improving training predictability, diagnostic capability, and resource efficiency.

Technology Category

Application Category

📝 Abstract

Effective LLM training relies on *consistency*, meaning that key quantities -- such as final losses and optimal hyperparameters -- scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.

Problem

Research questions and friction points this paper is trying to address.

Validating loss curve collapse in practical LLM scaling recipes

Establishing collapse as a signature of compute-efficient training

Developing applications for training diagnostics and hyperparameter tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collapse of loss curves enables predictable scaling

Deviation-from-collapse diagnoses training pathologies early

Collapsed curves facilitate early stopping in hyperparameter tuning

🔎 Similar Papers

No similar papers found.