π€ AI Summary
This work addresses the limitations of current automated peer review systems, which struggle to evaluate a paperβs novelty, significance, and deeper methodological flaws due to a lack of external scholarly context. To overcome this, the authors propose a context-aware multi-agent framework that emulates expert cognitive processes through a dual-stream mechanism: constructing historical narratives, detecting missing baseline comparisons, and performing multi-dimensional question-answering verification. The system integrates a historian agent, a baseline reconnaissance agent, and a multifaceted QA engine, augmented with real-time large-scale literature retrieval to dynamically build domain-specific knowledge graphs and actively validate the paperβs claims. Evaluated on the DeepReview-13K dataset, the approach significantly outperforms existing systems in pairwise assessments and substantially narrows the gap with human reviewers in terms of feedback diversity.
π Abstract
Automated peer review has evolved from simple text classification to structured feedback generation. However, current state-of-the-art systems still struggle with"surface-level"critiques: they excel at summarizing content but often fail to accurately assess novelty and significance or identify deep methodological flaws because they evaluate papers in a vacuum, lacking the external context a human expert possesses. In this paper, we introduce ScholarPeer, a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher. ScholarPeer employs a dual-stream process of context acquisition and active verification. It dynamically constructs a domain narrative using a historian agent, identifies missing comparisons via a baseline scout, and verifies claims through a multi-aspect Q&A engine, grounding the critique in live web-scale literature. We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations and reduces the gap to human-level diversity.