🤖 AI Summary
Existing learned optimizers (LOs) exhibit limited meta-generalization—particularly to unseen tasks requiring wider, deeper, or longer training trajectories. This work introduces μ-parameterization (μP) theory systematically into two mainstream LO architectures for the first time, proposing a μP-adapted lightweight meta-training paradigm. Methodologically, we derive theoretical scale-invariance conditions for LOs and design a low-overhead meta-training procedure (<250 GPU-hours). Experiments demonstrate that μLO matches or surpasses VeLO’s performance on large-width models—despite VeLO consuming 4,000 TPU-months—while improving meta-generalization in depth by 5× and extending maximal training-step generalization by 25×. This work establishes a rigorous theoretical foundation and an efficient implementation pathway for scalable, highly generalizable learned optimizers.
📝 Abstract
Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks much larger than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($mu$P) for two popular learned optimizer architectures and propose a simple meta-training recipe for $mu$-parameterized LOs ($mu$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (e.g., as they are trained in existing work). When applying our $mu$LOs, each trained for less than 250 GPU-hours, to large-width models we are often able to match or exceed the performance of pre-trained VeLO, the most performant publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. We also observe that learned optimizers trained with our $mu$LO recipe also exhibit substantially improved meta-generalization to deeper networks ($5 imes$ meta-training) and remarkable generalization to much longer training horizons ($25 imes$ meta-training).