Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the lack of comprehensive, verifiable, and scalable benchmarks for evaluating deep research agents. To this end, we introduce the Total Recall QA (TRQA) task, which adapts the TREC total recall paradigm to open-domain question answering evaluation for the first time. TRQA establishes a verifiable framework that integrates structured knowledge bases (Wikidata–Wikipedia) with synthetic e-commerce corpora. The benchmark features a single-answer query set with precise relevance annotations, enabling unified evaluation of both retrieval-based and end-to-end models. Designed with robustness against data contamination and extensibility in mind, TRQA supports reproducible and standardized assessment of research agents. We release the TRQA dataset along with performance results from multiple baseline models to foster future progress in this area.

Technology Category

Application Category

📝 Abstract

Deep research agents have emerged as LLM-based systems designed to perform multi-step information seeking and reasoning over large, open-domain sources to answer complex questions by synthesizing information from multiple information sources. Given the complexity of the task and despite various recent efforts, evaluation of deep research agents remains fundamentally challenging. This paper identifies a list of requirements and optional properties for evaluating deep research agents. We observe that existing benchmarks do not satisfy all identified requirements. Inspired by prior research on TREC Total Recall Tracks, we introduce the task of Total Recall Question Answering and develop a framework for deep research agents evaluation that satisfies the identified criteria. Our framework constructs single-answer, total recall queries with precise evaluation and relevance judgments derived from a structured knowledge base paired with a text corpus, enabling large-scale data construction. Using this framework, we build TRQA, a deep research benchmark constructed from Wikidata-Wikipedia as a real-world source and a synthetically generated e-commerce knowledge base and corpus to mitigate the effects of data contamination. We benchmark the collection with representative retriever and deep research models and establish baseline retrieval and end-to-end results for future comparative evaluation.

Problem

Research questions and friction points this paper is trying to address.

deep research agents

evaluation benchmark

Total Recall QA

information synthesis

verifiable evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Total Recall QA

deep research agents

verifiable evaluation