🤖 AI Summary
This work challenges the conventional view of weight matrices in Transformer language models as linear key-value memories and proposes a geometric memory mechanism instead. In this framework, subject embeddings encode attribute vectors via linear superposition, while a small MLP equipped with ReLU gating performs relation-conditioned attribute selection. Through theoretical construction, information-theoretic lower bounds, and controlled experiments, the study demonstrates that logarithmic embedding dimensions suffice for efficient factual memorization and reveals that the MLP learns a general-purpose selection mechanism rather than task-specific knowledge. In both single-layer Transformers and multi-hop query tasks, the model spontaneously acquires the theoretically predicted geometric structure, and the trained MLP exhibits strong zero-shot generalization to entirely novel bijective relations, highlighting its robust transfer capability.
📝 Abstract
How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.