TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Muon-style optimizers perform gradient orthogonalization only at the matrix level within individual layers, failing to capture cross-layer structural information and thereby limiting the convergence efficiency and stability of large language model pretraining. This work proposes TEON, a novel method that, for the first time, treats gradients as higher-order tensors and performs orthogonalization across the global, cross-layer structure, offering stronger theoretical convergence guarantees. By incorporating structured orthogonalization and an efficient approximate SVD algorithm, TEON significantly reduces both training and validation perplexity on GPT and LLaMA architectures (ranging from 130M to 1B parameters) and demonstrates robustness across various approximation strategies.

Technology Category

Application Category

📝 Abstract
The Muon optimizer has demonstrated strong empirical performance in pre-training large language models by performing matrix-level gradient (or momentum) orthogonalization in each layer independently. In this work, we propose TEON, a principled generalization of Muon that extends orthogonalization beyond individual layers by modeling the gradients of a neural network as a structured higher-order tensor. We present TEON's improved convergence guarantee over layer-wise Muon, and further develop a practical instantiation of TEON based on the theoretical analysis with corresponding ablation. We evaluate our approach on two widely adopted architectures: GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, ranging from 60M to 1B parameters. Experimental results show that TEON consistently improves training and validation perplexity across model scales and exhibits strong robustness under various approximate SVD schemes.
Problem

Research questions and friction points this paper is trying to address.

orthogonalization
large language model
pre-training
gradient optimization
tensorized representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

tensorized orthonormalization
gradient orthogonalization
large language model pre-training
higher-order tensor
convergence guarantee
🔎 Similar Papers
No similar papers found.