Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

This work investigates the mechanisms and limits underlying the superior performance of spectral optimizers—such as Muon—over conventional optimizers like SGD in associative memory learning, with a focus on storage capacity and retrieval dynamics. Leveraging a linear associative memory model under non-orthogonal Gaussian inputs and logistic loss, the study combines thresholded gradient approximations with spectral optimization theory to rigorously characterize, for the first time, Muon’s enhanced storage capacity, its critical batch size, and its accelerated initial retrieval rate. Theoretical analysis reveals that Muon substantially increases memory capacity and accommodates larger batch sizes, while both optimizers ultimately converge to the same information-theoretic limit. Synthetic experiments validate the derived scaling laws, providing a theoretical foundation for the signal amplification mechanism inherent in spectral optimizers.

Technology Category

Application Category

📝 Abstract

Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

Problem

Research questions and friction points this paper is trying to address.

spectral optimizers

associative memory

storage capacity

scaling laws

logistic regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral optimizers

associative memory

capacity scaling