🤖 AI Summary
This paper addresses the challenge of hyperparameter transferability across model scales—specifically width, depth, batch size, and training duration—where existing methods fail to generalize reliably. We propose Complete<sup>(d)</sup>, the first modular hyperparameter parameterization framework enabling robust cross-scale transfer under *multi-dimensional* coordinated scaling, overcoming the limitation of prior μP approaches that support only single-dimension scaling. Our method integrates modular hyperparameter optimization, AdamW hyperparameter space modeling, and joint search of residual block multipliers and initialization scales. Evaluated on large language models, Complete<sup>(d)</sup> demonstrates full-stack hyperparameter transferability—including learning rate, AdamW parameters (β₁, β₂, ε), weight decay, initialization scale, and residual scaling—across diverse model sizes. This yields substantially improved training stability and faster convergence, establishing a systematic, scalable paradigm for hyperparameter transfer in large-model training.
📝 Abstract
Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.