🤖 AI Summary
To address the dual challenges of expert scarcity and escalating review workloads in scientific peer review, this paper introduces the first scalable, automated evaluation framework for assessing LLMs’ paper review capabilities. Built upon 676 OpenReview papers, the framework implements a multi-dimensional evaluation pipeline integrating semantic similarity matching, structured review element extraction, multi-faceted consistency scoring, and alignment with expert-derived ground truth. It systematically quantifies LLM performance across key review tasks—including strength/weakness identification, novelty assessment, and acceptance recommendation. Experimental results reveal substantial systemic deficiencies: a 72% novelty misidentification rate, only 51.3% accuracy in acceptance decisions (significantly below expert performance), and pervasive perspective imbalance and decision bias. The framework enables longitudinal, cross-model evaluation, establishing a standardized benchmark and actionable insights for advancing LLMs’ scientific assistance capabilities.
📝 Abstract
Peer review is essential for scientific progress, but it faces challenges such as reviewer shortages and growing workloads. Although Large Language Models (LLMs) show potential for providing assistance, research has reported significant limitations in the reviews they generate. While the insights are valuable, conducting the analysis is challenging due to the considerable time and effort required, especially given the rapid pace of LLM developments. To address the challenge, we developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews. By constructing a dataset consisting of 676 OpenReview papers, we examined the agreement between LLMs and experts in their strength and weakness identifications. The results showed that LLMs lack balanced perspectives, significantly overlook novelty assessment when criticizing, and produce poor acceptance decisions. Our automated pipeline enables a scalable evaluation of LLMs' paper review capability over time.