🤖 AI Summary
Standard Transformers exhibit inherent limitations in multi-step reasoning, relational argumentation, and long-context integration. To address these challenges, this paper proposes LM2, a Large Memory Model that introduces a lightweight, non-intrusive, interactive auxiliary memory module within a decoder-only architecture. The module supports test-time adaptive memory updates and explicit memory modeling—without altering the base model’s structure or disrupting standard pretraining pipelines. It integrates cross-attention mechanisms, gated memory updates, and a context representation warehouse, ensuring interpretability and broad applicability. Experiments demonstrate that LM2 achieves substantial gains: +37.1% average accuracy over RMT and +86.3% over Llama-3.2 on BABILong; +5.0% on MMLU; and state-of-the-art performance on multi-hop reasoning, numerical reasoning, and hundred-thousand-token-context question answering tasks.
📝 Abstract
This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi-hop inference, numerical reasoning, and large-context question-answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre-trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test-time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.