MLP Memory: Language Modeling with Retriever-pretrained External Memory

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the hallucination-prone behavior of large language models (LLMs) in knowledge-intensive tasks, this paper proposes a differentiable external memory augmentation architecture: a retriever’s functionality is distilled into a pretrained MLP module, enabling end-to-end co-optimization with the Transformer decoder and effectively decoupling memory storage from logical reasoning. The design integrates the dynamic knowledge updating capability of retrieval-augmented generation (RAG) with the training flexibility of purely neural models. On WikiText-103 and Web datasets, perplexity improves by 17.5% and 24.1%, respectively; inference speed reaches 80× that of kNN-LM. The approach also achieves significant gains on multiple hallucination detection and factual recall benchmarks. Its core contribution lies in the first realization of a parameterized, trainable, and highly efficient joint retriever–decoder modeling framework.

Technology Category

Application Category

📝 Abstract
While modern decoder-only LLMs achieve superior performance across various domains, hallucinations have risen to be a common problem in their generated text, hindering their application in knowledge-intensive tasks. Retriever-augmented generation (RAG) offers a solution, but the non-parametric nature of the retriever hinders its deep interaction with LLM. In this work, we propose to decouple memorization from the LLM decoder using a pretrained, differentiable external memory. The external memory is an MLP pretrained by imitating the behavior of a retriever on the entire pretraining dataset. Our resulting architecture, which comprises a transformer decoder and an external MLP memory pretrained on language modeling and retriever imitation respectively, demonstrates strong perplexity and performance on downstream tasks. Experiments show our architecture exhibits steeper power-law scaling with model size, achieving 17.5% and 24.1% improvement on WikiText-103 and Web datasets compared to decoder-only models while benefiting from added training without overfitting. We demonstrate superior performance on three hallucination benchmarks and nine memory-intensive tasks. Additionally, our approach delivers $80 imes$ speedup over $k$NN-LM (500M tokens) and $1.3 imes$ faster inference than decoder-only models. Unlike $k$NN-LM, which impairs reasoning, our MLP memory improves StrategyQA performance. We will open-source our code and models in the future.
Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in LLM-generated text
Enhances deep interaction between retriever and LLM
Improves performance on memory-intensive tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained MLP external memory for decoupled memorization
Retriever imitation pretraining enhances memory interaction
Combines transformer decoder with differentiable external memory
🔎 Similar Papers
No similar papers found.
Rubin Wei
Rubin Wei
Shanghai Jiao Tong University
LLMMemory-Augmented LLM
Jiaqi Cao
Jiaqi Cao
Shanghai Jiao Tong University
Natural Language ProcessingLong-term Memory
J
Jiarui Wang
LUMIA Lab, Shanghai Jiao Tong University
Jushi Kai
Jushi Kai
Shanghai Jiao Tong University
Language ModelingLLMLong Context
Qipeng Guo
Qipeng Guo
Fudan University
B
Bowen Zhou
Shanghai Artificial Intelligence Laboratory, Electronic Engineering, Tsinghua University
Z
Zhouhan Lin
LUMIA Lab, Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory