Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the limitation of current AI peer-review evaluation, which overly emphasizes score prediction while neglecting the core intellectual value embedded in review texts—such as argumentation, questioning, and critique. The work proposes the first text-centric, multidimensional evaluation framework that assesses AI-generated reviews along five dimensions: content fidelity, argument consistency, focus stability, constructiveness of questions, and detectable AI artifacts. To handle expert disagreement, it introduces a Max-Recall strategy. Leveraging a high-confidence human review dataset and combining argument recall, n-gram comparison, and data-cleaning techniques, experiments reveal that conventional n-gram metrics poorly align with human preferences, whereas the proposed text-oriented metrics—particularly recall of critical weakness arguments—show strong correlation with scoring accuracy, underscoring the importance of aligning AI critiques with human expert focal points.

Technology Category

Application Category

📝 Abstract

The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.

Problem

Research questions and friction points this paper is trying to address.

automated peer review

review evaluation

rating prediction

textual justification

AI critique

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated peer review

holistic evaluation

Max-Recall strategy