Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neural network training is fundamentally a large-scale matrix optimization problem, yet existing methods often neglect the intrinsic matrix structure of model parameters. This paper proposes a low-rank orthogonalization optimization framework—the first to integrate low-rank matrix decomposition and orthogonalization directly into gradient updates, combining matrix-sign gradient descent with a low-rank variant of the Muon optimizer. By exploiting the empirically observed low-rank structure of gradients, our method achieves significant computational efficiency gains while preserving theoretical rigor. It outperforms fine-tuned baseline Muon in GPT-2 and LLaMA pretraining. We provide rigorous convergence analysis, proving superior iteration complexity and robust convergence under heavy-tailed gradient noise. This work establishes a novel matrix-aware optimization paradigm for large foundation models—balancing efficiency, theoretical soundness, and interpretability.

Technology Category

Application Category

📝 Abstract
Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon cite{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose {it low-rank orthogonalization}, which explicitly leverages the low-rank nature of gradients during NN training. Building on this, we propose low-rank matrix-signed gradient descent and a low-rank variant of Muon. Our numerical experiments demonstrate the superior performance of low-rank orthogonalization, with the low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the performance of the carefully tuned vanilla Muon. Theoretically, we establish the iteration complexity of the low-rank matrix-signed gradient descent for finding an approximate stationary solution, as well as that of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise.
Problem

Research questions and friction points this paper is trying to address.

Optimizing large-scale matrix problems in neural networks
Improving foundation model training via low-rank orthogonalization
Enhancing Muon optimizer with low-rank gradient exploitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank orthogonalization for matrix optimization
Low-rank matrix-signed gradient descent method
Low-rank variant of Muon optimizer
🔎 Similar Papers
No similar papers found.
C
Chuan He
Department of Mathematics, Linköping University, Sweden
Z
Zhanwang Deng
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, People’s Republic of China
Zhaosong Lu
Zhaosong Lu
University of Minnesota
continuous optimizationmachine learningcomputational statistics