Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional literature retrieval evaluation, which overly relies on human citations as ground truth and employs narrow assessment methodologies that fail to capture retrieval quality comprehensively. The authors propose a deep research retrieval pipeline that integrates a large language model as a neutral judge (LLM-as-a-judge), the OpenAlex collaboration graph, and the RollingEval-Jun25 benchmark to construct a multidimensional evaluation framework. Their analysis reveals significant collaboration bias in human citations, with only 51% deemed moderately or highly relevant. In contrast, AI-based reranking boosts relevance to 86–88% and increases recall on RollingEval-Jun25 from below 20% to over 80%. The study advocates for jointly reporting metrics including recall, topical relevance, diversity, and collaboration distance to foster a fairer and more holistic paradigm for retrieval evaluation.
📝 Abstract
We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.
Problem

Research questions and friction points this paper is trying to address.

literature search evaluation
ground truth
citation relevance
human reference bias
retrieval benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Research
literature search evaluation
LLM-as-a-judge
citation relevance
co-authorship bias