🤖 AI Summary
Current large language models (LLMs) struggle with deep research tasks in real-world web environments due to reliance on hand-crafted prompts or constrained retrieval-augmented generation (RAG) setups, failing to handle the web’s openness, dynamism, and noise. To address this, we propose the first end-to-end reinforcement learning (RL) framework enabling LLM agents to directly interact with web pages via browser APIs—performing autonomous information retrieval, multi-source cross-verification, emergent planning, self-reflection, and honest refusal to answer. Our method integrates a multi-agent architecture, adaptive webpage structure extraction, and RL training guided by reward modeling. Experiments on open-domain research tasks show that our approach outperforms prompt-engineering baselines by +28.9 points and RAG-based RL baselines by +7.2 points, significantly improving factual consistency and research robustness under realistic web conditions.
📝 Abstract
Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.