ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation frameworks lack reliability for assessing deep-reasoning agents’ capabilities—such as multi-step inference and cross-document synthesis—in open-ended queries. Method: We introduce the first standardized benchmark, comprising (i) a three-dimensional task complexity framework; (ii) a fine-grained scoring rubric evaluating factual grounding, reasoning rigor, and expressive clarity; and (iii) a hybrid human–LLM evaluation protocol. Contribution/Results: We release an open-source benchmark with 2,500+ annotated samples, full code, and detailed annotation guidelines. Empirical evaluation reveals that state-of-the-art systems achieve only ~68% compliance on average, exposing critical weaknesses in implicit context understanding and deep reasoning over retrieved information. The benchmark demonstrates strong sensitivity, diagnostic utility, and scalability—enabling rigorous, reproducible assessment of advanced reasoning behaviors in retrieval-augmented and agentic systems.

Technology Category

Application Category

📝 Abstract
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.
Problem

Research questions and friction points this paper is trying to address.

Evaluating complex deep research agents with diverse lengthy responses
Assessing factual grounding reasoning soundness and clarity in research
Measuring rubric compliance for multi-step cross-document synthesis tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized benchmark with expert-written rubrics
Complexity framework categorizing tasks along three axes
Human and model-based evaluation protocols for agents
🔎 Similar Papers
No similar papers found.