🤖 AI Summary
Existing large reasoning models (LRMs) rely on sequential knowledge retrieval for multi-step question answering, resulting in high latency, verbose context, and degraded coherence and accuracy. This work proposes HybridDeepSearcher (HDS), the first framework to support hybrid skip-and-deep retrieval, accompanied by a synthetically constructed dataset—HDS-QA—derived from Natural Questions and explicitly designed for mixed-hop reasoning. HDS introduces fine-grained reasoning-query-retrieval path modeling and enables end-to-end LRM fine-tuning to dynamically select parallel or sequential execution strategies for sub-queries. Its core innovation lies in adaptive hybrid scheduling of retrieval paradigms. On the FanOutQA and BrowseComp benchmarks, HDS achieves +15.9 and +11.5 F1 improvements, respectively, while substantially reducing search rounds and inference latency. The approach demonstrates both high accuracy and strong scalability across diverse multi-hop QA scenarios.
📝 Abstract
Large reasoning models (LRMs) have demonstrated strong performance in complex, multi-step reasoning tasks. Existing methods enhance LRMs by sequentially integrating external knowledge retrieval; models iteratively generate queries, retrieve external information, and progressively reason over this information. However, purely sequential querying increases inference latency and context length, diminishing coherence and potentially reducing accuracy. To address these limitations, we introduce HDS-QA (Hybrid Deep Search QA), a synthetic dataset automatically generated from Natural Questions, explicitly designed to train LRMs to distinguish parallelizable from sequential queries. HDS-QA comprises hybrid-hop questions that combine parallelizable independent subqueries (executable simultaneously) and sequentially dependent subqueries (requiring step-by-step resolution), along with synthetic reasoning-querying-retrieval paths involving parallel queries. We fine-tune an LRM using HDS-QA, naming the model HybridDeepSearcher, which outperforms state-of-the-art baselines across multiple benchmarks, notably achieving +15.9 and +11.5 F1 on FanOutQA and a subset of BrowseComp, respectively, both requiring comprehensive and exhaustive search. Experimental results highlight two key advantages: HybridDeepSearcher reaches comparable accuracy with fewer search turns, significantly reducing inference latency, and it effectively scales as more turns are permitted. These results demonstrate the efficiency, scalability, and effectiveness of explicitly training LRMs to leverage hybrid parallel and sequential querying.