Geometric Factual Recall in Transformers

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work challenges the conventional view of weight matrices in Transformer language models as linear key-value memories and proposes a geometric memory mechanism instead. In this framework, subject embeddings encode attribute vectors via linear superposition, while a small MLP equipped with ReLU gating performs relation-conditioned attribute selection. Through theoretical construction, information-theoretic lower bounds, and controlled experiments, the study demonstrates that logarithmic embedding dimensions suffice for efficient factual memorization and reveals that the MLP learns a general-purpose selection mechanism rather than task-specific knowledge. In both single-layer Transformers and multi-hop query tasks, the model spontaneously acquires the theoretically predicted geometric structure, and the trained MLP exhibits strong zero-shot generalization to entirely novel bijective relations, highlighting its robust transfer capability.

📝 Abstract

How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.

Problem

Research questions and friction points this paper is trying to address.

geometric memorization

factual recall

transformer

relational reasoning

embedding

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric memorization

linear superposition

relation-conditioned selection