MLP Memory: Language Modeling with Retriever-pretrained External Memory

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the hallucination-prone behavior of large language models (LLMs) in knowledge-intensive tasks, this paper proposes a differentiable external memory augmentation architecture: a retriever’s functionality is distilled into a pretrained MLP module, enabling end-to-end co-optimization with the Transformer decoder and effectively decoupling memory storage from logical reasoning. The design integrates the dynamic knowledge updating capability of retrieval-augmented generation (RAG) with the training flexibility of purely neural models. On WikiText-103 and Web datasets, perplexity improves by 17.5% and 24.1%, respectively; inference speed reaches 80× that of kNN-LM. The approach also achieves significant gains on multiple hallucination detection and factual recall benchmarks. Its core contribution lies in the first realization of a parameterized, trainable, and highly efficient joint retriever–decoder modeling framework.

Technology Category

Application Category

📝 Abstract

While modern decoder-only LLMs achieve superior performance across various domains, hallucinations have risen to be a common problem in their generated text, hindering their application in knowledge-intensive tasks. Retriever-augmented generation (RAG) offers a solution, but the non-parametric nature of the retriever hinders its deep interaction with LLM. In this work, we propose to decouple memorization from the LLM decoder using a pretrained, differentiable external memory. The external memory is an MLP pretrained by imitating the behavior of a retriever on the entire pretraining dataset. Our resulting architecture, which comprises a transformer decoder and an external MLP memory pretrained on language modeling and retriever imitation respectively, demonstrates strong perplexity and performance on downstream tasks. Experiments show our architecture exhibits steeper power-law scaling with model size, achieving 17.5% and 24.1% improvement on WikiText-103 and Web datasets compared to decoder-only models while benefiting from added training without overfitting. We demonstrate superior performance on three hallucination benchmarks and nine memory-intensive tasks. Additionally, our approach delivers $80 imes$ speedup over $k$NN-LM (500M tokens) and $1.3 imes$ faster inference than decoder-only models. Unlike $k$NN-LM, which impairs reasoning, our MLP memory improves StrategyQA performance. We will open-source our code and models in the future.

Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in LLM-generated text

Enhances deep interaction between retriever and LLM

Improves performance on memory-intensive tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained MLP external memory for decoupled memorization

Retriever imitation pretraining enhances memory interaction

Combines transformer decoder with differentiable external memory

🔎 Similar Papers

No similar papers found.