On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This study systematically evaluates the real-world capabilities and limitations of AI peer reviewers in scientific manuscript assessment, moving beyond prior approaches that merely compared AI outputs to human review conclusions. By engaging 45 domain experts to annotate human and AI-generated reviews for 82 Nature-series papers across multiple dimensions, the authors establish the first fine-grained critique-based evaluation framework, assessing performance in accuracy, issue significance, and evidential support. Reviews were generated using state-of-the-art large language models, including GPT-5.2, Gemini 3.0 Pro, and Claude Opus 4.5. Results show that GPT-5.2 achieved an overall score surpassing that of the top-scoring individual human reviewer, identified 26% of issues missed by humans, yet exhibited 16 distinct failure modes—including high redundancy and insufficient subfield expertise—highlighting fundamental differences from and complementary potential with human reviewers.
📝 Abstract
With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.
Problem

Research questions and friction points this paper is trying to address.

AI reviewers
peer review
scientific evaluation
review quality
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI peer review
expert annotation
review quality evaluation
large language models
scientific reproducibility
🔎 Similar Papers
No similar papers found.
Seungone Kim
Seungone Kim
Carnegie Mellon University
Large Language ModelsNatural Language Processing
D
Dongkeun Yoon
KAIST
Kiril Gashteovski
Kiril Gashteovski
Senior Reseach Scientist at NEC Laboratories Europe, Germany
Artificial IntelligenceNatural Language ProcessingEvaluationExplainable AI
Juyoung Suk
Juyoung Suk
KAIST
Large Language Models
Jinheon Baek
Jinheon Baek
Ph.D. student, KAIST
Machine LearningNatural Language ProcessingRAG
Pranjal Aggarwal
Pranjal Aggarwal
Carnegie Mellon University
Ian Wu
Ian Wu
Carnegie Mellon University
Machine Learning
V
Viktor Zaverkin
INM - Leibniz Institute for New Materials; Saarland University; German Research Center for Artificial Intelligence (DFKI)
S
Spase Petkoski
Ss. Cyril and Methodius University in Skopje; Aix Marseille University, INSERM
D
Daniel R. Schrider
University of North Carolina at Chapel Hill
I
Ilija Dukovski
Ss. Cyril and Methodius University in Skopje; Boston University
Francesco Santini
Francesco Santini
Dipartimento Matematica e Informatica, Perugia
Constraint ProgrammingArgumentation FrameworksOrchestration/Coreography in Service Oriented ArchitecturesTrust Management
B
Biljana Mitreska
University of Manchester
Yong Jeong
Yong Jeong
Professor, Bio and Brain Engineering, KAIST
Neurodegenerative diseasesCognitionneuroimagingneurovascular unit
K
Kyeongha Kwon
KAIST
Y
Young Min Sim
KAIST
D
Dragana Manasova
Massachusetts Institute of Technology
A
Arthur Porto
Florida Museum of Natural History, University of Florida
Biljana Mojsoska
Biljana Mojsoska
Associate Professor, Roskilde University
Microbiologyanalytical chemistrypeptide chemistryproteomics
M
Makoto Takamoto
NEC Laboratories Europe
M
Marko Shuntov
University of Copenhagen
R
Ruoqi Liu
Stanford University
Hyunjoo Jenny Lee
Hyunjoo Jenny Lee
KAIST
Bio/Medical MEMSNeural InterfaceUltrasound Neuromodulation
N
Niyazi Ulas Dinç
École Polytechnique Fédérale de Lausanne
Y
Yehhyun Jo
Institute for Basic Science (IBS)