Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are increasingly deployed in scientific peer review, yet their impact lacks systematic empirical validation. To address this gap, we introduce GenReview—the largest publicly available dataset of AI-generated reviews to date (81K instances)—constructed from real ICLR submissions (2018–2025) and their original human reviews, with diverse critiques synthesized via multi-sentiment prompting. Our study is the first to systematically uncover three critical properties of LLM-generated reviews: pronounced preference biases, instruction-following deviations, and detectability using existing forensic methods; we further confirm that LLM-assigned scores align with actual acceptance decisions only for accepted papers. GenReview provides a foundational, open-source resource enabling rigorous investigation into review automation, inter-reviewer consistency, bias quantification, and research integrity.

Technology Category

Application Category

📝 Abstract
How does the progressive embracement of Large Language Models (LLMs) affect scientific peer reviewing? This multifaceted question is fundamental to the effectiveness -- as well as to the integrity -- of the scientific process. Recent evidence suggests that LLMs may have already been tacitly used in peer reviewing, e.g., at the 2024 International Conference of Learning Representations (ICLR). Furthermore, some efforts have been undertaken in an attempt to explicitly integrate LLMs in peer reviewing by various editorial boards (including that of ICLR'25). To fully understand the utility and the implications of LLMs' deployment for scientific reviewing, a comprehensive relevant dataset is strongly desirable. Despite some previous research on this topic, such dataset has been lacking so far. We fill in this gap by presenting GenReview, the hitherto largest dataset containing LLM-written reviews. Our dataset includes 81K reviews generated for all submissions to the 2018--2025 editions of the ICLR by providing the LLM with three independent prompts: a negative, a positive, and a neutral one. GenReview is also linked to the respective papers and their original reviews, thereby enabling a broad range of investigations. To illustrate the value of GenReview, we explore a sample of intriguing research questions, namely: if LLMs exhibit bias in reviewing (they do); if LLM-written reviews can be automatically detected (so far, they can); if LLMs can rigorously follow reviewing instructions (not always) and whether LLM-provided ratings align with decisions on paper acceptance or rejection (holds true only for accepted papers). GenReview can be accessed at the following link: https://anonymous.4open.science/r/gen_review.
Problem

Research questions and friction points this paper is trying to address.

Investigating LLM usage impact on scientific peer review integrity
Creating largest dataset of AI-generated vs human-written reviews
Analyzing LLM bias, detection, and rating alignment in reviews
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale dataset of AI-generated peer reviews
Used multiple prompt types to generate diverse reviews
Linked reviews to original papers for comprehensive analysis
🔎 Similar Papers
No similar papers found.