Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the behavioral mechanisms of the Muon optimizer in training Vision Transformers (ViTs), with a focus on its interaction with data augmentation strategies and its impact on gradient spectral structure. Through singular value spectrum analysis of gradient matrices, comparative experiments between Muon and AdamW, and evaluation across diverse augmentation protocols on datasets including ImageNet-100 and Pl@ntNet-300K, the work reveals that Muon critically relies on strong data augmentation to mitigate gradient spectral concentration and mode collapse in deep MLP layers. Notably, Muon exhibits a broader distribution of gradient energy in query-key-value (QKV) projections. It consistently outperforms AdamW across image classification, segmentation, and masked autoencoding tasks, achieving particularly significant gains in macro Top-1 accuracy on long-tailed datasets.
📝 Abstract
Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed "full" augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
optimizer-recipe interaction
gradient spectra
Muon optimizer
training dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Muon optimizer
Vision Transformers
gradient spectra
optimizer-recipe interaction
singular value decomposition
🔎 Similar Papers
No similar papers found.