Stop Automating Peer Review Without Rigorous Evaluation

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of large language models (LLMs) in peer review, highlighting risks of homogenized assessments and susceptibility to manipulation. For the first time, it empirically compares human- and AI-generated reviews for ICLR 2026 submissions by automatically rewriting manuscript texts and conducting controlled experiments with quantitative analysis to assess the quality, diversity, and robustness of AI-generated reviews. The findings reveal a pervasive “hive-mind” phenomenon in AI reviewing—characterized by highly consistent scoring across diverse papers and significant vulnerability to minor stylistic alterations in the input text. These results demonstrate that general-purpose LLMs are ill-suited for reliable peer review and underscore the urgent need for specialized, automated scientific frameworks tailored to the peer review process.

📝 Abstract

Large language models offer a tempting solution to address the peer review crisis. This position paper argues that today's AI systems should not be used to produce paper reviews. We ground this position in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through paper laundering: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are necessary but not sufficient conditions for automation. We argue that addressing the peer review crisis requires a science of peer review automation -- not general-purpose LLMs deployed without rigorous evaluation.

Problem

Research questions and friction points this paper is trying to address.

peer review

large language models

review diversity

gaming AI reviewers

paper laundering

Innovation

Methods, ideas, or system contributions that make the work stand out.

peer review automation

large language models

hivemind effect