The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Polar decomposition and matrix sign function computation in deep learning—particularly within the Muon optimization framework—suffer from low computational efficiency, poor GPU compatibility, and unnecessary high-precision overhead. Method: We propose Polar Express, an algorithm that relies exclusively on matrix multiplication, integrating iterative polynomial updates with stepwise minimax optimization. It provides theoretical guarantees of optimal worst-case convergence rate. The method supports stable training in bfloat16 precision and features a native GPU implementation seamlessly integrated into the Muon framework. Contribution/Results: Experiments on large language models—including GPT-2—demonstrate significantly reduced validation loss across the full learning rate spectrum, outperforming state-of-the-art alternatives. Polar Express achieves substantial training speedup while reducing memory footprint, establishing a new efficiency frontier for polar-based optimization in deep learning.

Technology Category

Application Category

📝 Abstract
Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen&Chow and Nakatsukasa&Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
Problem

Research questions and friction points this paper is trying to address.

Develop GPU-efficient polar decomposition for deep learning
Improve convergence speed and stability in matrix sign methods
Optimize Muon algorithm performance for large-scale models
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-friendly polar decomposition algorithm
Adaptive polynomial update via minimax optimization
Stable in bfloat16 with rapid convergence
Noah Amsel
Noah Amsel
Courant Institute, NYU
D
David Persson
New York University and Flatiron Institute
Christopher Musco
Christopher Musco
Associate Professor, New York University
AlgorithmsTheory of ComputationMachine Learning
R
Robert Gower
Flatiron Institute