🤖 AI Summary
This work investigates the empirically observed 1/3 power-law scaling in loss convergence during large language model training, a phenomenon whose underlying mechanism has remained unclear. Through theoretical analysis and extensive empirical validation, we demonstrate that this slow convergence arises from an inherent optimization bottleneck in the softmax cross-entropy objective when learning highly peaked distributions—such as those encountered in next-token prediction tasks—where both loss and gradients naturally exhibit power-law decay. By combining simplified analytical models, large-scale language model experiments, and dynamic analyses of loss-gradient trajectories, we provide the first mechanistic explanation for the ubiquitous 1/3 power-law time scaling, showing it is determined by the interplay between the target distribution’s structure and the model architecture, rather than specific data or implementation details. Our findings are consistently validated across diverse models and tasks, offering new insights into neural scaling laws and pathways to improved training efficiency.
📝 Abstract
Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.