DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of factuality verification in deep research reports (DRRs), where claim-level fact-checking is hindered by the limitations of existing fact-checking systems and static benchmarks. To overcome this, the authors propose an Audit-then-Score (AtS) framework that introduces a dynamic, iteratively refined benchmarking mechanism, transforming human experts from one-time annotators into high-precision arbitrators and enabling co-evolution between verification agents and the benchmark itself. Built upon this framework, the versioned benchmark DeepFact-Bench and the document-level verifier DeepFact-Eval—along with its lightweight variant—demonstrate substantial improvements in fact-checking performance: expert accuracy on a micro-gold standard rises from 60.8% to 90.9%, and DeepFact-Eval consistently outperforms state-of-the-art methods on both internal and external datasets, exhibiting strong transferability.

Technology Category

Application Category

📝 Abstract
Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.
Problem

Research questions and friction points this paper is trying to address.

factuality
deep research reports
benchmarking
claim verification
LLM agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolving Benchmarking
Audit-then-Score
Deep Research Reports
Factuality Verification
Search-Augmented LLM Agents
🔎 Similar Papers
No similar papers found.