Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) are increasingly deployed in scientific peer review, yet their impact lacks systematic empirical validation. To address this gap, we introduce GenReview—the largest publicly available dataset of AI-generated reviews to date (81K instances)—constructed from real ICLR submissions (2018–2025) and their original human reviews, with diverse critiques synthesized via multi-sentiment prompting. Our study is the first to systematically uncover three critical properties of LLM-generated reviews: pronounced preference biases, instruction-following deviations, and detectability using existing forensic methods; we further confirm that LLM-assigned scores align with actual acceptance decisions only for accepted papers. GenReview provides a foundational, open-source resource enabling rigorous investigation into review automation, inter-reviewer consistency, bias quantification, and research integrity.

Technology Category

Application Category

📝 Abstract

How does the progressive embracement of Large Language Models (LLMs) affect scientific peer reviewing? This multifaceted question is fundamental to the effectiveness -- as well as to the integrity -- of the scientific process. Recent evidence suggests that LLMs may have already been tacitly used in peer reviewing, e.g., at the 2024 International Conference of Learning Representations (ICLR). Furthermore, some efforts have been undertaken in an attempt to explicitly integrate LLMs in peer reviewing by various editorial boards (including that of ICLR'25). To fully understand the utility and the implications of LLMs' deployment for scientific reviewing, a comprehensive relevant dataset is strongly desirable. Despite some previous research on this topic, such dataset has been lacking so far. We fill in this gap by presenting GenReview, the hitherto largest dataset containing LLM-written reviews. Our dataset includes 81K reviews generated for all submissions to the 2018--2025 editions of the ICLR by providing the LLM with three independent prompts: a negative, a positive, and a neutral one. GenReview is also linked to the respective papers and their original reviews, thereby enabling a broad range of investigations. To illustrate the value of GenReview, we explore a sample of intriguing research questions, namely: if LLMs exhibit bias in reviewing (they do); if LLM-written reviews can be automatically detected (so far, they can); if LLMs can rigorously follow reviewing instructions (not always) and whether LLM-provided ratings align with decisions on paper acceptance or rejection (holds true only for accepted papers). GenReview can be accessed at the following link: https://anonymous.4open.science/r/gen_review.

Problem

Research questions and friction points this paper is trying to address.

Investigating LLM usage impact on scientific peer review integrity

Creating largest dataset of AI-generated vs human-written reviews

Analyzing LLM bias, detection, and rating alignment in reviews

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale dataset of AI-generated peer reviews

Used multiple prompt types to generate diverse reviews

Linked reviews to original papers for comprehensive analysis

🔎 Similar Papers

The Great AI Witch Hunt: Reviewers Perception and (Mis)Conception of Generative AI in Research Writing