🤖 AI Summary
This work addresses the challenge of inefficient exploration in sparse-reward reinforcement learning, where conventional methods struggle to adapt their search strategies during testing due to limited single-episode exploration. To overcome this, the authors propose MR-Search, a novel approach that leverages contextual meta-reinforcement learning and a self-reflection mechanism to enable agents to dynamically refine their exploration policies at test time by aggregating cross-episode experiences. The method incorporates self-reflection as a contextual signal within a multi-episode reinforcement learning framework and employs fine-grained relative advantage estimation to facilitate effective credit assignment and online policy improvement. Empirical results demonstrate that MR-Search achieves performance gains of 9.2%–19.3% over baseline methods across eight benchmark tasks, significantly enhancing both generalization capability and search efficiency.
📝 Abstract
This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.