π€ AI Summary
This work addresses a critical limitation in current large language model (LLM) detection methods, which oversimplify peer review authorship attribution as a binary classification between human and AI-generated text, thereby neglecting the nuanced reality of hybrid collaboration where intellectual origin and textual expression may diverge. To bridge this gap, the authors introduce PeerPrism, a novel benchmark comprising 20,690 fully human-written, fully synthetic, and various mixed-generation peer reviews. PeerPrism reframes authorship attribution as a multidimensional construct integrating semantic reasoning and stylistic expression, and establishes the first evaluation dataset specifically designed for human-AI collaborative reviewing. Through controlled generation, state-of-the-art detectors, stylometric analysis, and semantic probing, the study reveals substantial performance discrepancies among existing methods in mixed-authorship scenarios, frequently conflating AI-generated phrasing with AI-originated ideasβa fundamental flaw that exposes their inability to distinguish surface-level expression from substantive intellectual contribution.
π Abstract
Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration.
In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution.
Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.