🤖 AI Summary
This work addresses the limitation of existing scaling laws, which typically fix the optimizer (e.g., AdamW) and thus fail to capture the intricate coupling between novel optimizers and model or data scale. The authors propose a unified cross-optimizer scaling law that shares common power-law exponents while incorporating optimizer-specific scaling factors, combining empirical fitting with theoretical analysis on convex quadratic objectives. This formulation reveals that Chinchilla-style scaling arises from decomposing loss into irreducible, approximation, and optimization errors, and demonstrates that conventional independently fitted scaling parameters are highly correlated and ill-conditioned. The new framework substantially improves prediction stability and enables fair performance comparisons across advanced optimizers such as MuOn, Shampoo, and SOAP.
📝 Abstract
The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers. Finally, 3) we provide a theoretical analysis of gradient-based methods for the proxy task of a convex quadratic objective, demonstrating that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.