Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

๐Ÿ“… 2026-02-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work provides a theoretical explanation for the content-addressable memory capability of Transformers in long-context settings. By modeling the context as a probability measure and interpreting the attention mechanism as an integral operator acting on measures, the authors propose a โ€œrecall-predictโ€ decomposition framework and, for the first time, formally characterize the associative memory mechanism of Transformers within a measure-theoretic framework. Under spectral assumptions, they prove that shallow Transformers combined with MLPs can learn the recall-predict mapping via empirical risk minimization at a minimax-optimal rate. Matching lower bounds are established, confirming the tightness of the analysis and demonstrating provable generalization guarantees for the proposed paradigm.

Technology Category

Application Category

๐Ÿ“ Abstract
Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $\nu = I^{-1} \sum_{i=1}^I \mu^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $\mu^{(i^*)}$ and (ii) prediction from $(\mu_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.
Problem

Research questions and friction points this paper is trying to address.

associative memory
transformers
measure theory
minimax optimality
distributional contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

measure-theoretic Transformer
associative memory
minimax optimality
softmax attention
distributional context
๐Ÿ”Ž Similar Papers
No similar papers found.