DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing evaluation benchmarks for deep-research-oriented large language models (LLMs) lack systematicity and interpretability, particularly for expert-level report generation. Method: This paper introduces DEER—the first multidimensional evaluation benchmark tailored for expert-grade research reports—covering 50 tasks across 13 domains. It establishes a domain-expert-driven assessment framework with seven primary dimensions and 25 subdimensions, comprising 130 fine-grained criteria, enabling full-report-level factual verification—including unquoted claims. DEER innovatively integrates (i) expert-grounded multidimensional evaluation, (ii) task-specific expert guidance mechanisms, (iii) domain-agnostic claim extraction and external evidence quality quantification, and (iv) expert-knowledge-enhanced LLM-based judgment. Results: Experiments show strong agreement between DEER and human expert judgments (Spearman ρ > 0.92), enabling precise identification of model deficiencies in authority, reasoning coherence, and evidence coverage—thereby significantly improving evaluation reliability, interpretability, and diagnostic utility.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

Problem

Research questions and friction points this paper is trying to address.

Evaluates expert-level deep research reports from LLMs

Provides systematic criteria and expert-grounded evaluation taxonomy

Verifies factual reliability across entire reports including uncited claims

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-grounded evaluation taxonomy with fine-grained rubric items

Document-level fact-checking architecture verifying all report claims

Task-specific expert guidance for consistent LLM judge assessments

🔎 Similar Papers

No similar papers found.