When and Why Grouping Attention Heads Accelerates Muon Optimization

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the granularity dilemma in applying the Muon optimizer to multi-head attention: whether to operate on the entire QKV matrix, individual attention heads, or groups of heads. The authors propose Group Muon, which formalizes head grouping as an integral component of optimizer design. Through a one-step descent analysis, they characterize the trade-off between the whitening-induced gains from grouping and the associated norm penalties, treating group size and grouping rules as tunable hyperparameters. Experiments on GPT-2 Small trained on the FineWeb dataset demonstrate that properly configured Group Muon significantly outperforms both full-matrix Muon and head-wise MuonSplit in terms of validation loss.
📝 Abstract
Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon. Our analysis reveals a trade-off between the \textbf{group-wise whitening gain} from group-wise updates and the \textbf{grouping-induced norm cost}, an additional update-norm cost caused by replacing full-matrix whitening with group-wise whitening. Motivated by this trade-off, we propose \textbf{Group Muon}, which treats head group size and grouping rule as optimizer hyperparameters. On GPT-2 Small trained on FineWeb, appropriate grouping improves validation loss over both full-QKV Muon and fully head-wise MuonSplit.
Problem

Research questions and friction points this paper is trying to address.

multi-head attention
Muon optimization
grouping
granularity mismatch
whitening
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Muon
multi-head attention
whitening gain
norm cost
optimizer hyperparameters
🔎 Similar Papers
No similar papers found.