Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the mechanisms and limits underlying the superior performance of spectral optimizers—such as Muon—over conventional optimizers like SGD in associative memory learning, with a focus on storage capacity and retrieval dynamics. Leveraging a linear associative memory model under non-orthogonal Gaussian inputs and logistic loss, the study combines thresholded gradient approximations with spectral optimization theory to rigorously characterize, for the first time, Muon’s enhanced storage capacity, its critical batch size, and its accelerated initial retrieval rate. Theoretical analysis reveals that Muon substantially increases memory capacity and accommodates larger batch sizes, while both optimizers ultimately converge to the same information-theoretic limit. Synthetic experiments validate the derived scaling laws, providing a theoretical foundation for the signal amplification mechanism inherent in spectral optimizers.
📝 Abstract
Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.
Problem

Research questions and friction points this paper is trying to address.

spectral optimizers
associative memory
storage capacity
scaling laws
logistic regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral optimizers
associative memory
capacity scaling
Muon
logistic regression
🔎 Similar Papers
No similar papers found.