Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) struggle with dynamic Search Intensity Scaling (SIS)—i.e., adaptively determining when, how deeply, and how frequently to search the open web based on query ambiguity and evidential conflict—leading to overconfidence and insufficient verification. This work formally defines SIS for open-web information seeking and introduces WebPuzzle, the first benchmark dataset designed specifically for this task. We propose a reinforcement learning (RL) framework grounded in realistic web interaction, combining cold-start supervised fine-tuning with phased RL training. Integrating SIS capabilities into Pangu-7B-Reasoner yields Pangu-7B-Reasoner+DeepDiver. Experiments demonstrate that our model matches the performance of the 671B-parameter DeepSeek-R1 on real-world web tasks, while significantly improving evidence verification in multi-hop reasoning and long-text generation. Crucially, it generalizes SIS beyond closed QA to complex generative tasks.

Technology Category

Application Category

📝 Abstract
Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing methods rely on static prompting rules or training with Wikipedia-based corpora and retrieval environments, limiting adaptability to the real-world web environment where ambiguity, conflicting evidence, and noise are prevalent. These constrained training settings hinder LLMs from learning to dynamically decide when and where to search, and how to adjust search depth and frequency based on informational demands. We define this missing capacity as Search Intensity Scaling (SIS)--the emergent skill to intensify search efforts under ambiguous or conflicting conditions, rather than settling on overconfident, under-verification answers. To study SIS, we introduce WebPuzzle, the first dataset designed to foster information-seeking behavior in open-world internet environments. WebPuzzle consists of 24K training instances and 275 test questions spanning both wiki-based and open-web queries. Building on this dataset, we propose DeepDiver, a Reinforcement Learning (RL) framework that promotes SIS by encouraging adaptive search policies through exploration under a real-world open-web environment. Experimental results show that Pangu-7B-Reasoner empowered by DeepDiver achieve performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's training curriculum from cold-start supervised fine-tuning to a carefully designed RL phase, and present that its capability of SIS generalizes from closed-form QA to open-ended tasks such as long-form writing. Our contributions advance adaptive information seeking in LLMs and provide a valuable benchmark and dataset for future research.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with adaptive open-web question answering
Existing methods lack adaptability to real-world web noise
Models fail to dynamically adjust search depth and frequency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning for adaptive search
Open-web environment training dataset
Dynamic search intensity scaling
🔎 Similar Papers
No similar papers found.