Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study addresses the problem of determining optimal learning rate and batch size scaling laws for large language model (LLM) pretraining as a function of model parameter count and dataset size. Methodologically, we conduct an extensive empirical investigation—training 3,700 dense and Mixture-of-Experts (MoE) models across diverse configurations using over one million H800 GPU-hours—combined with large-scale grid search, convex optimization modeling, and power-law fitting. Our key contribution is the first unified theoretical framework for optimal hyperparameter scaling, applicable across architectures (dense/MoE), data distributions, and sparsity levels. We identify convexity in the hyperparameter optimization landscape and define an “optimal plateau region”; further, we propose a plug-and-play predictive tool achieving performance within 0.07% of the global optimum. The derived scaling laws demonstrate high robustness to variations in model shape, sparsity, and data distribution. All training loss curves and checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract

The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.07% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/

Problem

Research questions and friction points this paper is trying to address.

Identifies universal scaling laws for hyperparameters in LLM pretraining.

Develops a plug-and-play tool for optimal hyperparameter estimation.

Unifies hyperparameter optimization across diverse model structures and data distributions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal scaling laws for hyperparameters optimization

Plug-and-play tool for optimal hyperparameter estimation

Extensive computational resources for model training

🔎 Similar Papers

Temporal Scaling Law for Large Language Models