DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing benchmarks inadequately assess language models’ true capabilities in deep research tasks due to their lack of systematic evaluation of large-scale evidence retrieval, cross-source information integration, and long-range reasoning. This work introduces the first high-difficulty benchmark specifically designed for deep research, decomposing tasks into four core capability dimensions: retrieval, derivation, reasoning, and calibration. To enhance auditability, it incorporates a four-level provenance annotation scheme and a cross-source verification mechanism. Leveraging a fine-grained capability taxonomy and structured scoring rules, experiments across nine state-of-the-art models reveal that derivation and calibration errors account for over 70% of failures, with distinct error patterns differentiating strong and weak models. Furthermore, models exhibit genuine domain-specific expertise, as evidenced by only moderate cross-model consistency (ρ = 0.61).

📝 Abstract

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

Problem

Research questions and friction points this paper is trying to address.

deep research

cross-source evidence

long-horizon derivation

language model evaluation

reasoning benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepWeb-Bench

cross-source reconciliation

long-horizon derivation