🤖 AI Summary
Directly measuring the Hessian curvature of the loss landscape in large language models (LLMs) is computationally prohibitive for practical training. This work proposes “critical sharpness”—a scalable metric that efficiently estimates curvature using fewer than ten forward passes—and introduces “relative critical sharpness” to guide data mixing strategies from pretraining to fine-tuning. For the first time, curvature dynamics such as progressive sharpening and stable edge phenomena are observed in models up to 7B parameters. The effectiveness of these metrics is validated during both pretraining and mid-training phases of the OLMo-2 model, offering a practical tool for curvature-based diagnostics and data composition optimization in large-scale LLM training.
📝 Abstract
Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($\lambda_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($\lambda_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $\Delta \mathbf{\theta}$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($\lambda_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.