🤖 AI Summary
This paper addresses the fundamental paradigm shift in machine learning under the large language model (LLM) era: the primary objective transitions from minimizing generalization error to reducing approximation error, and the dominant strategy shifts from regularization to model scaling. Motivated by two critical questions—(1) what new principles guide effective model scaling, and (2) how to reliably compare models under single large-scale experimental constraints—the authors conduct extensive empirical analysis, scaling law modeling, and training dynamics observation. They systematically evaluate classical implicit regularization techniques (e.g., L2 regularization, small batch sizes, large learning rates) and find them broadly ineffective in the LLM regime. Crucially, they identify the “scaling-law crossover” phenomenon, challenging the cross-scale transferability assumption of optimization methods. Based on these findings, the paper establishes two foundational problems for the LLM era: “scaling-oriented design” and “reliable model comparison under single-run constraints,” providing both theoretical grounding and a practical framework for algorithm design and evaluation.
📝 Abstract
The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: $ullet$ Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? $ullet$ Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?