🤖 AI Summary
Existing deep research systems lack cross-domain, multidimensional evaluation benchmarks, particularly exhibiting gaps in critical dimensions such as objectivity and citation quality. To address this limitation, this work introduces and open-sources DRACO, a benchmark comprising complex research tasks spanning ten domains and drawing from information sources across forty countries. These tasks are derived from real user queries, anonymized and enhanced to preserve authenticity while ensuring privacy. DRACO establishes the first multidimensional evaluation framework tailored to authentic deep research scenarios, incorporating human-rated assessments along four key axes: accuracy, completeness, objectivity, and citation quality. This benchmark provides a standardized, reproducible tool for evaluating model capabilities in handling complex research-oriented tasks.
📝 Abstract
We present DRACO (Deep Research Accuracy, Completeness, and Objectivity), a benchmark of complex deep research tasks. These tasks, which span 10 domains and draw on information sources from 40 countries, originate from anonymized real-world usage patterns within a large-scale deep research system. Tasks are sampled from a de-identified dataset of Perplexity Deep Research requests, then filtered and augmented to ensure that the tasks are anonymized, open-ended and complex, objectively evaluable, and representative of the broad scope of real-world deep research use cases. Outputs are graded against task-specific rubrics along four dimensions: factual accuracy (accuracy), breadth and depth of analysis (including completeness), presentation quality (including objectivity), and citation quality. DRACO is publicly available at https://hf.co/datasets/perplexity-ai/draco.