μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

📅 2024-05-31
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing learned optimizers (LOs) exhibit limited meta-generalization—particularly to unseen tasks requiring wider, deeper, or longer training trajectories. This work introduces μ-parameterization (μP) theory systematically into two mainstream LO architectures for the first time, proposing a μP-adapted lightweight meta-training paradigm. Methodologically, we derive theoretical scale-invariance conditions for LOs and design a low-overhead meta-training procedure (<250 GPU-hours). Experiments demonstrate that μLO matches or surpasses VeLO’s performance on large-width models—despite VeLO consuming 4,000 TPU-months—while improving meta-generalization in depth by 5× and extending maximal training-step generalization by 25×. This work establishes a rigorous theoretical foundation and an efficient implementation pathway for scalable, highly generalizable learned optimizers.

Technology Category

Application Category

📝 Abstract
Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks much larger than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($mu$P) for two popular learned optimizer architectures and propose a simple meta-training recipe for $mu$-parameterized LOs ($mu$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (e.g., as they are trained in existing work). When applying our $mu$LOs, each trained for less than 250 GPU-hours, to large-width models we are often able to match or exceed the performance of pre-trained VeLO, the most performant publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. We also observe that learned optimizers trained with our $mu$LO recipe also exhibit substantially improved meta-generalization to deeper networks ($5 imes$ meta-training) and remarkable generalization to much longer training horizons ($25 imes$ meta-training).
Problem

Research questions and friction points this paper is trying to address.

Improve meta-generalization of learned optimizers for unseen tasks
Address optimization challenges in wider networks than meta-trained
Enhance generalization to deeper networks and longer training horizons
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximal Update Parametrization for LOs
Simple meta-training recipe for μLOs
Improved meta-generalization to wider networks