Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses the limitation of traditional scaling laws, which overlook the influence of optimizers on a model’s representational capacity and thus fail to explain performance variations arising from differing training dynamics under identical architectures. For the first time, the optimizer is treated as a first-order variable in representation scaling. By employing spectral analysis and soft/hard spectral rank metrics, the study systematically compares how optimizers such as AdamW and Muon affect spectral capacity utilization while holding architecture constant. The findings reveal that the optimizer’s impact on spectral scaling exponents can surpass that of architectural modifications, and that loss-matching does not guarantee equivalent representational structures. Notably, Muon achieves near-linear hard rank scaling (β=1.02) on difficult-to-learn tail tokens—substantially outperforming AdamW (β=0.44)—yielding a 2.3× improvement in scaling efficiency.
📝 Abstract
Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ($β$=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ($β$=1.02) in the same regimes, a $2.3\times$ increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.
Problem

Research questions and friction points this paper is trying to address.

scaling laws
optimizer
spectral capacity
representation structure
Transformer architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral scaling laws
optimizer-induced representation
hard/soft spectral rank
representation geometry
optimizer-architecture co-design
🔎 Similar Papers