DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
Existing benchmarks struggle to comprehensively evaluate agents’ ability to integrate web browsing with multi-step reasoning in real-world complex tasks. This work proposes an answer-first synthetic benchmark generation framework that automatically constructs multi-hop questions across five domains, requiring the combination of entity recognition, attribute retrieval, and domain-specific computation. The approach innovatively incorporates four-dimensional constraints—verifiability, complexity, difficulty, and diversity—and employs a two-stage verification process alongside greedy max-min embedding filtering to enhance both question quality and coverage breadth. Human evaluation confirms a validity rate of 76% (rising to 84% after excluding outdated data), while state-of-the-art models achieve only 20% accuracy. The generated benchmark exhibits significantly greater semantic diversity compared to human-curated benchmarks such as BrowseComp+, MATH-500, and GPQA.

Technology Category

Application Category

📝 Abstract
Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.
Problem

Research questions and friction points this paper is trying to address.

deep research agents
web browsing
multi-step computation
benchmarking
knowledge-graph reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic benchmark
answer-first pipeline
multi-hop reasoning
knowledge-graph computation
semantic diversity
🔎 Similar Papers
No similar papers found.