🤖 AI Summary
This study addresses the limitation of current AI peer-review evaluation, which overly emphasizes score prediction while neglecting the core intellectual value embedded in review texts—such as argumentation, questioning, and critique. The work proposes the first text-centric, multidimensional evaluation framework that assesses AI-generated reviews along five dimensions: content fidelity, argument consistency, focus stability, constructiveness of questions, and detectable AI artifacts. To handle expert disagreement, it introduces a Max-Recall strategy. Leveraging a high-confidence human review dataset and combining argument recall, n-gram comparison, and data-cleaning techniques, experiments reveal that conventional n-gram metrics poorly align with human preferences, whereas the proposed text-oriented metrics—particularly recall of critical weakness arguments—show strong correlation with scoring accuracy, underscoring the importance of aligning AI critiques with human expert focal points.
📝 Abstract
The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.