When and Why Grouping Attention Heads Accelerates Muon Optimization

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the granularity dilemma in applying the Muon optimizer to multi-head attention: whether to operate on the entire QKV matrix, individual attention heads, or groups of heads. The authors propose Group Muon, which formalizes head grouping as an integral component of optimizer design. Through a one-step descent analysis, they characterize the trade-off between the whitening-induced gains from grouping and the associated norm penalties, treating group size and grouping rules as tunable hyperparameters. Experiments on GPT-2 Small trained on the FineWeb dataset demonstrate that properly configured Group Muon significantly outperforms both full-matrix Muon and head-wise MuonSplit in terms of validation loss.

📝 Abstract

Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon. Our analysis reveals a trade-off between the \textbf{group-wise whitening gain} from group-wise updates and the \textbf{grouping-induced norm cost}, an additional update-norm cost caused by replacing full-matrix whitening with group-wise whitening. Motivated by this trade-off, we propose \textbf{Group Muon}, which treats head group size and grouping rule as optimizer hyperparameters. On GPT-2 Small trained on FineWeb, appropriate grouping improves validation loss over both full-QKV Muon and fully head-wise MuonSplit.

Problem

Research questions and friction points this paper is trying to address.

multi-head attention

Muon optimization

grouping

granularity mismatch

whitening

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Muon

multi-head attention

whitening gain