OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

267K/year

🤖 AI Summary

This work addresses the scarcity and opacity of high-quality training data that have hindered broad academic participation in frontier web-search agent research. To overcome this limitation, the authors propose a fact-anchored, scalable, and controllable question-answer synthesis framework that integrates multi-hop question generation—leveraging web-graph topology expansion and entity confusion—with a retrospective summarization-based denoising trajectory synthesis mechanism to efficiently construct high-fidelity training data. Remarkably, supervised fine-tuning with only 11.7k synthetic samples enables the model to match or surpass both existing open-source and industrial-grade systems on benchmarks such as BrowseComp and BrowseComp-ZH, significantly outperforming DeepDive and even exceeding Tongyi DeepResearch on BrowseComp-ZH, thereby advancing the democratization of open web-search agent research.

Technology Category

Application Category

📝 Abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

Problem

Research questions and friction points this paper is trying to address.

search agents

training data scarcity

open-source

Large Language Models

democratization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fact-grounded QA synthesis

Denoised trajectory synthesis

Multi-hop reasoning