WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

πŸ“… 2025-10-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing web search agents suffer from insufficient tool-call depth and error accumulation in multi-step interactions within complex web environments. This paper proposes a self-reflection-enhanced reinforcement learning framework, introducing a large-scale trajectory dataset annotated with reflection-mode labels to enable a two-stage training paradigm: cold-start pretraining followed by RL-based fine-tuningβ€”fully end-to-end optimized within a single 14B-parameter model. The method supports long-horizon tool orchestration and dynamic decision-making, significantly improving retrieval robustness. It achieves state-of-the-art accuracy of 72.3% on HotpotQA and 90.0% on SimpleQA, while demonstrating strong out-of-distribution generalization. The core contribution lies in deeply integrating a structured reflection mechanism into the RL pipeline, effectively mitigating error propagation through iterative self-assessment and corrective action selection.

Technology Category

Application Category

πŸ“ Abstract
Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at https://github.com/99hgz/WebSeer
Problem

Research questions and friction points this paper is trying to address.

Training deeper search agents with self-reflection mechanism
Overcoming shallow tool-use depth in interactive retrieval
Reducing error accumulation in multi-step search interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with self-reflection mechanism
Two-stage training framework for tool-use trajectories
Single 14B model achieving state-of-the-art accuracy
πŸ”Ž Similar Papers
No similar papers found.
G
Guanzhong He
Tsinghua University
Z
Zhen Yang
Tsinghua University
J
Jinxin Liu
Tsinghua University
B
Bin Xu
Tsinghua University
Lei Hou
Lei Hou
RMIT University
Building Information Modeling (BIM) - Project Management - Construction IT - Productivity Research - Lean Construction
Juanzi Li
Juanzi Li
Tsinghua University
Semantic Webdata miningNLP