MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing search agents face a trade-off between computational overhead and information loss in multi-turn interactions: feeding full dialogue history causes context explosion, while using only the current turn sacrifices critical contextual information. This paper proposes MemSearcher, the first end-to-end reinforcement learning framework that jointly models reasoning, search, and memory management. Its core is a compact memory mechanism that dynamically fuses historical and current inputs to maintain stable context length. We further design Multi-Context GRPO, a novel RL algorithm that jointly optimizes policy selection and memory updates, enabling cross-dialogue trajectory knowledge transfer while preserving both information fidelity and efficiency. Evaluated on Qwen2.5-series models across seven public benchmarks, MemSearcher achieves significant gains—averaging +11% accuracy over strong baselines—and even outperforms the 7B baseline with the smaller 3B-Instruct variant, delivering higher accuracy and lower computational cost.

Technology Category

Application Category

📝 Abstract
Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher
Problem

Research questions and friction points this paper is trying to address.

Optimizes memory management in search agents to reduce computational costs
Addresses trade-off between information integrity and context length scalability
Jointly trains reasoning, search strategies and memory via reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively maintains compact memory for efficient reasoning
Uses multi-context GRPO RL framework for joint optimization
Combines current turn with memory to stabilize context length
🔎 Similar Papers
No similar papers found.
Q
Qianhao Yuan
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Jie Lou
Jie Lou
Xiaohongshu
AlignmentRLHF
Z
Zichao Li
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
J
Jiawei Chen
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences