On the Optimal Memorization Capacity of Transformers

📅 2024-09-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work investigates the parameter efficiency limits of Transformers in sequence memorization tasks, rigorously characterizing their optimal memory capacity bounds under both next-token prediction and seq2seq settings. Using theoretical analysis, hard-attention modeling, information-theoretic tools, and lower-bound proofs on parametric complexity, we establish tight (up to logarithmic factors) upper and lower bounds: memorizing *N* sequences requires only Õ(√*N*) parameters in the next-token setting, or Õ(√(*nN*)) in the seq2seq setting (where *n* is sequence length). Crucially, we show that self-attention efficiently identifies input patterns, whereas the feed-forward network (FFN) forms a bottleneck for label mapping, fundamentally limiting overall memory efficiency. Our results provide the first theoretically grounded, quantitative characterization of Transformer memory capacity, offering principled insights into architectural design trade-offs and establishing a formal benchmark for understanding memorization mechanisms in attention-based models.

Technology Category

Application Category

📝 Abstract

Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $ ilde{O}(sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $ ilde{O}(sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.

Problem

Research questions and friction points this paper is trying to address.

Optimal memorization capacity of Transformers

Efficiency in next-token prediction setting

Bottleneck in feed-forward network

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers memorize labels optimally

Parameter sharing enhances memorization efficiency

Feed-forward network limits token labeling

🔎 Similar Papers

No similar papers found.