🤖 AI Summary
Open-source Deep Research Agents (DRAs) are scarce and underperforming, lacking standardized, reproducible benchmarks for evaluation. Method: This work introduces (1) BrowseComp-Small (BC-Small), a lightweight, reproducible benchmark to fill the open-evaluation gap; (2) three key enhancements to the open-source DRA ODR—improved task decomposition, integration of retrieval-augmented generation (RAG), and fine-tuned web information extraction; and (3) rigorous ablation studies to validate each component’s contribution. Results: On BC-Small, both the original ODR and leading closed-source systems achieve 0% success rate, whereas the optimized ODR+ attains a 10% success rate—the first DRA (open- or closed-source) to surpass this threshold. This breakthrough advances the development of autonomous web-based research agents by enabling open, reproducible, and empirically grounded progress.
📝 Abstract
We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.