Constructing Efficient Fact-Storing MLPs for Transformers

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low factual knowledge storage efficiency, parameter redundancy, and editability limitations of MLP modules in Transformers. Methodologically: (1) it introduces value embedding metrics to quantify storage efficiency; (2) it designs a gradient-aware encoder–decoder mechanism, theoretically characterizing the fundamental trade-off between factual capacity and model usability; (3) it enables modular MLP replacement via explicit weight construction and information-theoretic boundary analysis. Contributions include: (i) the first instantiation of information-theoretically optimal factual storage efficiency; (ii) support for editable, layer-wise factual updates and precise recall within a single Transformer layer; and (iii) near-universal input–output adaptability with asymptotically optimal parameter efficiency—while preserving model usability.

Technology Category

Application Category

📝 Abstract
The success of large language models (LLMs) can be attributed in part to their ability to efficiently store factual knowledge as key-value mappings within their MLP parameters. Recent work has proposed explicit weight constructions to build such fact-storing MLPs, providing an improved understanding of LLM fact storage mechanisms. In this paper, we introduce an MLP construction framework that improves over previous constructions in three areas: it 1) works for all but a measure-zero set of feasible input-output pairs, 2) achieves asymptotically optimal parameter efficiency matching information-theoretic bounds for some embeddings, and 3) maintains usability within Transformers for factual recall. Through our improvements, we 1) discover a metric on value embeddings that characterizes facts-per-parameter scaling for both constructed and gradient-descent-trained MLPs, 2) identify a simple encoder-decoder mechanism that empirically matches gradient-descent MLP facts-per-parameter asymptotics across all the inputs and outputs we test, and 3) uncover a fundamental tradeoff between an MLP's fact-storage capacity and its usability within Transformers. Finally, we demonstrate a proof-of-concept application of fact-storing MLPs: modular fact editing on one-layer Transformers by extit{replacing entire MLPs at once}.
Problem

Research questions and friction points this paper is trying to address.

Constructs efficient MLPs for storing factual knowledge in Transformers
Achieves optimal parameter efficiency and broad applicability for input-output pairs
Explores trade-offs between fact-storage capacity and usability in Transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLP construction works for all feasible input-output pairs
Achieves optimal parameter efficiency matching information-theoretic bounds
Maintains usability within Transformers for factual recall
🔎 Similar Papers
No similar papers found.
Owen Dugan
Owen Dugan
Stanford CS PhD Candidate
R
Roberto Garcia
Institute for Computational & Mathematical Engineering, Stanford University
R
Ronny Junkins
Computer Science Department, Stanford University
J
Jerry Liu
Institute for Computational & Mathematical Engineering, Stanford University
D
Dylan Zinsley
Computer Science Department, University of Wisconsin–Madison
Sabri Eyuboglu
Sabri Eyuboglu
PhD Student in Computer Science, Stanford University
Machine learning
Atri Rudra
Atri Rudra
Katherine Johnson Chair in AI, Professor, CSE, University at Buffalo
Structured Linear AlgebraSociety and ComputingCoding TheoryDatabase algorithms
C
Chris Ré
Computer Science Department, Stanford University