🤖 AI Summary
Large language models (LLMs) suffer from statelessness and limited context windows, hindering long-horizon reasoning. Existing external memory approaches rely heavily on static, hand-crafted heuristics and lack dynamic, content-aware memory control. To address this, we propose the first end-to-end learnable memory management framework grounded in reinforcement learning (RL). Our method introduces a dual-agent architecture comprising a Memory Manager and a Response Agent, jointly optimizing structured memory operations—ADD, UPDATE, and DELETE—via policy learning. We employ outcome-oriented RL algorithms, including PPO and GRPO, enabling effective fine-tuning under minimal supervision (only 152 question-answer pairs). Experiments demonstrate substantial improvements over state-of-the-art baselines across diverse complex reasoning tasks—including multi-step deduction, temporal reasoning, and knowledge-intensive QA—while exhibiting strong generalization and robustness across different LLM backbones.
📝 Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations {ADD, UPDATE, DELETE, NOOP}, and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and use with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the most competitive existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behaviors in LLMs, pointing toward richer, more persistent reasoning systems.