🤖 AI Summary
Existing search agents face a trade-off between computational overhead and information loss in multi-turn interactions: feeding full dialogue history causes context explosion, while using only the current turn sacrifices critical contextual information. This paper proposes MemSearcher, the first end-to-end reinforcement learning framework that jointly models reasoning, search, and memory management. Its core is a compact memory mechanism that dynamically fuses historical and current inputs to maintain stable context length. We further design Multi-Context GRPO, a novel RL algorithm that jointly optimizes policy selection and memory updates, enabling cross-dialogue trajectory knowledge transfer while preserving both information fidelity and efficiency. Evaluated on Qwen2.5-series models across seven public benchmarks, MemSearcher achieves significant gains—averaging +11% accuracy over strong baselines—and even outperforms the 7B baseline with the smaller 3B-Instruct variant, delivering higher accuracy and lower computational cost.
📝 Abstract
Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher