Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the limitation of traditional scaling laws, which overlook the influence of optimizers on a model’s representational capacity and thus fail to explain performance variations arising from differing training dynamics under identical architectures. For the first time, the optimizer is treated as a first-order variable in representation scaling. By employing spectral analysis and soft/hard spectral rank metrics, the study systematically compares how optimizers such as AdamW and Muon affect spectral capacity utilization while holding architecture constant. The findings reveal that the optimizer’s impact on spectral scaling exponents can surpass that of architectural modifications, and that loss-matching does not guarantee equivalent representational structures. Notably, Muon achieves near-linear hard rank scaling (β=1.02) on difficult-to-learn tail tokens—substantially outperforming AdamW (β=0.44)—yielding a 2.3× improvement in scaling efficiency.

📝 Abstract

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ($β$=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ($β$=1.02) in the same regimes, a $2.3\times$ increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.

Problem

Research questions and friction points this paper is trying to address.

scaling laws

optimizer

spectral capacity

representation structure

Transformer architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral scaling laws

optimizer-induced representation

hard/soft spectral rank