Celo2: Towards Learned Optimization Free Lunch

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing learned optimizers suffer from poor generalization and prohibitively high meta-training costs, hindering practical deployment. This work proposes a streamlined normalized optimizer architecture coupled with an enhanced meta-training strategy that drastically reduces computational overhead—requiring only 4.5 GPU hours—while remaining compatible with modern optimization techniques such as orthogonalization, layer-wise updates, and decoupled weight decay. The resulting learned optimizer scales robustly to billion-parameter models, outperforming prior methods on GPT-3 XL (1.3B) and demonstrating strong out-of-distribution generalization across diverse tasks, thereby overcoming limitations imposed by model scale and distributional shifts.

Technology Category

Application Category

📝 Abstract

Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this work paves the way for practically applicable learnable optimization algorithms, unlocking exploration of richer meta-training and data curation recipes to further improve performance.

Problem

Research questions and friction points this paper is trying to address.

learned optimizers

meta-generalization

meta-training cost

optimization algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

learned optimizer

meta-training

generalization

scalable optimization