Cautious Optimizers: Improving Training with One Line of Code

📅 2024-11-25
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
To address the slow convergence and poor stability of momentum-based optimizers (e.g., AdamW) in Transformer pretraining, this paper proposes the “Cautious Optimizer” framework. Grounded in Hamiltonian dynamics and Lyapunov stability theory, it first identifies intrinsic structural deficiencies in momentum optimizers and then introduces a joint gradient clipping and step-size calibration mechanism. The framework enables theoretical convergence guarantees with minimal implementation overhead—requiring only a single line of PyTorch code to transform any momentum optimizer into a certified variant (e.g., C-AdamW, C-Lion). Empirically, it accelerates Llama and MAE pretraining by 1.47× and significantly improves performance across LLM post-training tasks. The open-source implementation has been widely adopted in the community.

Technology Category

Application Category

📝 Abstract
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.
Problem

Research questions and friction points this paper is trying to address.

Optimizer Improvement
Transformer Model Training
Momentum-based Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pytorch
Cautious Optimizer
Performance Enhancement