🤖 AI Summary
Training large language models (LLMs) suffers from instability and poor convergence. To address this, we propose POET—a novel reparameterization method that introduces orthogonal equivalence transformations into LLM optimization for the first time. POET represents each weight matrix as the product of two learnable orthogonal matrices and a fixed random matrix, thereby strictly preserving the singular spectrum of the original weights. This design simultaneously enhances optimization stability and generalization performance. To enable scalability, we further develop an efficient approximation framework incorporating orthogonal optimization, frozen random weights, and low-rank gradient reconstruction. Extensive experiments demonstrate that POET significantly accelerates convergence and improves final model performance across diverse LLM training tasks, while markedly increasing training robustness. Notably, POET successfully enables efficient and stable training of billion-parameter-scale models.
📝 Abstract
While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.