Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of hyperparameter transferability across model scales—specifically width, depth, batch size, and training duration—where existing methods fail to generalize reliably. We propose Complete<sup>(d)</sup>, the first modular hyperparameter parameterization framework enabling robust cross-scale transfer under *multi-dimensional* coordinated scaling, overcoming the limitation of prior μP approaches that support only single-dimension scaling. Our method integrates modular hyperparameter optimization, AdamW hyperparameter space modeling, and joint search of residual block multipliers and initialization scales. Evaluated on large language models, Complete<sup>(d)</sup> demonstrates full-stack hyperparameter transferability—including learning rate, AdamW parameters (β₁, β₂, ε), weight decay, initialization scale, and residual scaling—across diverse model sizes. This yields substantially improved training stability and faster convergence, establishing a systematic, scalable paradigm for hyperparameter transfer in large-model training.

Technology Category

Application Category

📝 Abstract
Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.
Problem

Research questions and friction points this paper is trying to address.

Extends hyperparameter transfer to width, depth, batch, and duration scaling
Investigates per-module hyperparameter optimization and transfer challenges
Demonstrates training speed improvements in Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified scaling across width, depth, batch size, and training duration
Enables per-module hyperparameter optimization and transfer
Improves training speed in Large Language Models
🔎 Similar Papers
No similar papers found.