Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing conditional memory expansion methods require training large-scale memory tables from scratch, which is costly and yields limited gains. This work proposes Memory Grafting, a novel approach that leverages a frozen pre-trained model as a reusable external memory constructor for the first time. It enables efficient memory extension through offline n-gram memory generation, exact longest suffix matching retrieval, lightweight projection with gating adaptation, and a hash-based fallback mechanism. By decoupling memory capacity from trainable parameters, the method transcends conventional model scaling paradigms. Evaluated on a 2.8B-parameter model, it achieves an average score of 53.86, outperforming both MoE (51.95) and the original Engram (52.43). Moreover, all variants at the 0.92B scale significantly surpass baseline models, demonstrating highly effective memory augmentation with minimal computational overhead.

📝 Abstract

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

Problem

Research questions and friction points this paper is trying to address.

conditional memory

memory scaling

language model pre-training

external latent memory

memory efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory Grafting

conditional memory

offline memory retrieval