🤖 AI Summary
This work addresses the challenge of factuality verification in deep research reports (DRRs), where claim-level fact-checking is hindered by the limitations of existing fact-checking systems and static benchmarks. To overcome this, the authors propose an Audit-then-Score (AtS) framework that introduces a dynamic, iteratively refined benchmarking mechanism, transforming human experts from one-time annotators into high-precision arbitrators and enabling co-evolution between verification agents and the benchmark itself. Built upon this framework, the versioned benchmark DeepFact-Bench and the document-level verifier DeepFact-Eval—along with its lightweight variant—demonstrate substantial improvements in fact-checking performance: expert accuracy on a micro-gold standard rises from 60.8% to 90.9%, and DeepFact-Eval consistently outperforms state-of-the-art methods on both internal and external datasets, exhibiting strong transferability.
📝 Abstract
Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.