ARO: A New Lens On Matrix Optimization For Large Models

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing optimization methods based on orthogonalization or whitening face efficiency bottlenecks in large language model training. This work proposes the Adaptive Rotation Optimization (ARO) framework, which introduces gradient rotation as a first-class optimization primitive for the first time. ARO performs norm-aware steepest descent in a rotated coordinate system via adaptive, norm-sensitive coordinate rotations. By reinterpreting the optimization process through the lens of rotational symmetry in residual streams, ARO opens new avenues for cross-layer and cross-module coupled optimization. Under rigorously controlled benchmarks, ARO achieves 1.3–1.35× faster convergence than AdamW and outperforms orthogonalization-based methods by 1.1–1.15×, all without exhibiting diminishing returns, even when trained with 8B activated parameters and an 8× larger training budget.

Technology Category

Application Category

📝 Abstract
Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 $\sim$1.35$\times$) and orthogonalization methods (by 1.1$\sim$1.15$\times$) in LLM pretraining at up to 8B activated parameters, and up to $8\times$ overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.
Problem

Research questions and friction points this paper is trying to address.

matrix optimization
large language models
training efficiency
orthogonalization
gradient rotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptively Rotated Optimization
matrix optimization
gradient rotation
norm-informed policy
rotational symmetry
🔎 Similar Papers
No similar papers found.