Automatically Evaluating the Paper Reviewing Capability of Large Language Models

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of expert scarcity and escalating review workloads in scientific peer review, this paper introduces the first scalable, automated evaluation framework for assessing LLMs’ paper review capabilities. Built upon 676 OpenReview papers, the framework implements a multi-dimensional evaluation pipeline integrating semantic similarity matching, structured review element extraction, multi-faceted consistency scoring, and alignment with expert-derived ground truth. It systematically quantifies LLM performance across key review tasks—including strength/weakness identification, novelty assessment, and acceptance recommendation. Experimental results reveal substantial systemic deficiencies: a 72% novelty misidentification rate, only 51.3% accuracy in acceptance decisions (significantly below expert performance), and pervasive perspective imbalance and decision bias. The framework enables longitudinal, cross-model evaluation, establishing a standardized benchmark and actionable insights for advancing LLMs’ scientific assistance capabilities.

Technology Category

Application Category

📝 Abstract
Peer review is essential for scientific progress, but it faces challenges such as reviewer shortages and growing workloads. Although Large Language Models (LLMs) show potential for providing assistance, research has reported significant limitations in the reviews they generate. While the insights are valuable, conducting the analysis is challenging due to the considerable time and effort required, especially given the rapid pace of LLM developments. To address the challenge, we developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews. By constructing a dataset consisting of 676 OpenReview papers, we examined the agreement between LLMs and experts in their strength and weakness identifications. The results showed that LLMs lack balanced perspectives, significantly overlook novelty assessment when criticizing, and produce poor acceptance decisions. Our automated pipeline enables a scalable evaluation of LLMs' paper review capability over time.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' paper reviewing capability.
Compare LLM reviews with expert reviews.
Assess LLMs' performance in identifying strengths and weaknesses.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic evaluation pipeline
Comparison with expert reviews
Scalable LLM capability assessment
🔎 Similar Papers
No similar papers found.