🤖 AI Summary
Current academic publishing policies permit peer reviewers to use large language models (LLMs) solely for text polishing, yet the enforceability of this restriction remains questionable. This study presents the first systematic evaluation of leading AI text detectors in the context of peer review, constructing a simulated review dataset that reflects varying degrees of human–AI collaboration. To enhance detection performance, the work incorporates review-specific signals such as paper context and scientific writing constraints. Evaluations across five state-of-the-art detectors—including two commercial systems—reveal that existing methods struggle to reliably distinguish between purely human-written and LLM-polished texts, frequently misclassifying polished content as entirely AI-generated. These findings indicate that current detection technologies are insufficient to support policy enforcement, warranting caution in interpreting alleged violations.
📝 Abstract
A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.