BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deep-reasoning agent benchmarks (e.g., BrowseComp) suffer from fairness and transparency deficiencies: they rely on opaque, dynamic search APIs—undermining reproducibility—and lack controlled document corpora, hindering disentanglement of retriever versus LLM contributions. To address these issues, we propose BrowseComp-Plus—the first open-source evaluation benchmark built upon a fixed, manually curated document corpus. It integrates reproducible retrievers including BM25 and Qwen3-Embedding-8B, and introduces human-verified supporting documents alongside challenging negative samples to enable precise decoupling of retrieval and reasoning capabilities. Experiments demonstrate that BrowseComp-Plus effectively discriminates model performance: GPT-5 augmented with Qwen3-Embedding-8B achieves 70.1% accuracy while substantially reducing search API calls, outperforming leading open-source alternatives.

Technology Category

Application Category

📝 Abstract
Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.
Problem

Research questions and friction points this paper is trying to address.

Fair comparison hindered by dynamic web APIs
Lack of transparency in document corpus control
Inability to isolate retriever contributions effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fixed curated corpus for fair comparisons
Human-verified documents and mined negatives
Disentangled analysis of retrieval methods
🔎 Similar Papers
No similar papers found.
Zijian Chen
Zijian Chen
Shanghai Jiao Tong University | Shanghai AI Laboratory
Image/Video Quality AssessmentLarge Multi-modal Models
Xueguang Ma
Xueguang Ma
University of Waterloo
information retrievalnatural language processing
Shengyao Zhuang
Shengyao Zhuang
Amazon, AGI
Information RetrievalNLP
Ping Nie
Ping Nie
Waterloo University
Natural Language ProcessingInformation RetrievalRecommendation SystemsTime Series Forecasting
Kai Zou
Kai Zou
Founder CEO, ProtagoLabs, NetMind.ai and AGI odyssey
Artificial General Intelligence
A
Andrew Liu
University of Waterloo
J
Joshua Green
University of Waterloo
K
Kshama Patel
University of Waterloo
R
Ruoxi Meng
University of Waterloo
M
Mingyi Su
University of Waterloo
Sahel Sharifymoghaddam
Sahel Sharifymoghaddam
University of Waterloo
Natural Language ProcessingInformation Retrieval
Yanxi Li
Yanxi Li
University of Sydney
Deep LearningComputer VisionVision TransformerAdversarial RobustnessGenerative Modeling
H
Haoran Hong
University of Waterloo
X
Xinyu Shi
University of Waterloo
Xuye Liu
Xuye Liu
University of Waterloo
Natural Language ProcessingLLMHuman-AI CollaborationMulti-modal Learning
Nandan Thakur
Nandan Thakur
PhD Student, University of Waterloo
information retrievalnatural language processingdeep learningmachine learning
Crystina Zhang
Crystina Zhang
University of Waterloo
Information RetrievalNatural Language Processing
Luyu Gao
Luyu Gao
Carnegie Mellon University
Information RetrievalNatural Language Processing
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning
Jimmy Lin
Jimmy Lin
University of Waterloo
information retrievalnatural language processingdata managementbig data