OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases

πŸ“… 2026-02-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses performance limitations in large-scale model training caused by implicit biases, redundant optimizer states, and optimizer mismatch during fine-tuning. The authors propose OLion, the first optimizer to integrate orthogonalized momentum with element-wise sign-based updates. By leveraging Newton–Schulz iteration to approximate orthogonalization of momentum directions and jointly enforcing spectral norm and β„“βˆž constraints under a diagonal isotropy assumption, OLion efficiently approximates the ideal Hadamard structure. Requiring only momentum-level state storage, OLion matches or surpasses the performance of AdamW and Muon across pretraining and supervised fine-tuning on GPT-2, Llama, and SiT architectures, while significantly alleviating the optimizer mismatch issue commonly observed when fine-tuning models pretrained with AdamW.

Technology Category

Application Category

πŸ“ Abstract
Many optimizers can be interpreted as steepest-descent methods under norm-induced geometries, and thus inherit corresponding implicit biases. We introduce \nameA{} (\fullname{}), which combines spectral control from orthogonalized update directions with $\ell_\infty$-style coordinate control from sign updates. \nameA{} forms a Lion-style momentum direction, approximately orthogonalizes it via a few Newton--Schulz iterations, and then applies an entrywise sign, providing an efficient approximation to taking a maximal step over the intersection of the spectral and $\ell_\infty$ constraint sets (a scaled Hadamard-like set for matrix parameters). Despite the strong nonlinearity of orthogonalization and sign, we prove convergence under a mild, empirically verified diagonal-isotropy assumption. Across large-scale language and vision training, including GPT-2 and Llama pretraining, SiT image pretraining, and supervised fine-tuning, \nameA{} matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level optimizer state, and it mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints.
Problem

Research questions and friction points this paper is trying to address.

implicit bias
spectral norm
ell_infty norm
optimizer mismatch
Hadamard ideal
Innovation

Methods, ideas, or system contributions that make the work stand out.

implicit bias
spectral norm
ell_infty optimization
orthogonalization
efficient optimizer
πŸ”Ž Similar Papers
No similar papers found.