MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing search agents face a trade-off between computational overhead and information loss in multi-turn interactions: feeding full dialogue history causes context explosion, while using only the current turn sacrifices critical contextual information. This paper proposes MemSearcher, the first end-to-end reinforcement learning framework that jointly models reasoning, search, and memory management. Its core is a compact memory mechanism that dynamically fuses historical and current inputs to maintain stable context length. We further design Multi-Context GRPO, a novel RL algorithm that jointly optimizes policy selection and memory updates, enabling cross-dialogue trajectory knowledge transfer while preserving both information fidelity and efficiency. Evaluated on Qwen2.5-series models across seven public benchmarks, MemSearcher achieves significant gains—averaging +11% accuracy over strong baselines—and even outperforms the 7B baseline with the smaller 3B-Instruct variant, delivering higher accuracy and lower computational cost.

Technology Category

Application Category

📝 Abstract

Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

Problem

Research questions and friction points this paper is trying to address.

Optimizes memory management in search agents to reduce computational costs

Addresses trade-off between information integrity and context length scalability

Jointly trains reasoning, search strategies and memory via reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively maintains compact memory for efficient reasoning

Uses multi-context GRPO RL framework for joint optimization

Combines current turn with memory to stabilize context length

🔎 Similar Papers

No similar papers found.

Authors to Follow