The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of ensuring review quality under constrained peer-review resources by automating the assessment of review utility for authors. We first propose a systematic, four-dimensional definition of utility—actionability, evidentiary support & specificity, verifiability, and helpfulness—and construct RevUtil, a large-scale benchmark comprising human-annotated and controllably synthesized review–response pairs. We design an evaluation framework supporting multi-dimensional scoring and rationale generation. Leveraging fine-grained annotations, we adapt open-source language models, achieving human-level or better inter-annotator agreement (vs. GPT-4o) across all dimensions. Empirical analysis reveals that current AI-generated reviews remain substantially less useful than human-written ones. Nevertheless, this work establishes the first reproducible benchmark and methodological foundation for automated utility evaluation of peer reviews.

Technology Category

Application Category

📝 Abstract
Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.
Problem

Research questions and friction points this paper is trying to address.

Automated systems measure peer review feedback utility
Identify key aspects making reviews useful for authors
Benchmark models for assessing and generating review comments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned models assess review comments
Synthetic dataset with rationales for training
Models achieve human-level agreement on aspects
🔎 Similar Papers
No similar papers found.
A
Abdelrahman Sadallah
NLP Department, Mohamed Bin Zayed University of Artificial Intelligence
T
Tim Baumgärtner
Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), Technical University of Darmstadt
Iryna Gurevych
Iryna Gurevych
Full Professor, TU Darmstadt; Adjunct Professor, MBZUAI, UAE; Affiliated Professor, INSAIT, Bulgaria
Natural Language ProcessingLarge Language ModelsArtificial Intelligence
Ted Briscoe
Ted Briscoe
Professor of Natural Language Processing, MBZUAI
Computational LinguisticsNatural Language ProcessingEvolutionary Linguistics