🤖 AI Summary
Distinguishing LLM-generated from human-written peer reviews remains challenging, particularly when LLMs are used for editing or ghostwriting. Method: We propose the first hypothesis-free watermarking detection framework specifically designed for peer review scenarios. Our approach employs indirect prompt injection—such as font-based steganography and jailbreaking prompts—within PDF manuscripts to implicitly trigger LLMs to generate reviews embedding detectable watermarks. We design a lightweight watermark encoding scheme and integrate a statistically rigorous family-wise error rate control procedure—superior to Bonferroni correction—to ensure bounded false-positive rates and high detection power across multiple concurrent reviews. Contribution/Results: The framework requires no prior assumptions about human-written text, exhibits robustness against common adversarial defenses (e.g., paraphrasing, reformatting), achieves high watermark embedding success across diverse LLMs, and demonstrates empirically low false-positive rates and significantly improved detection performance over baseline methods.
📝 Abstract
Editors of academic journals and program chairs of conferences require peer reviewers to write their own reviews. However, there is growing concern about the rise of lazy reviewing practices, where reviewers use large language models (LLMs) to generate reviews instead of writing them independently. Existing tools for detecting LLM-generated content are not designed to differentiate between fully LLM-generated reviews and those merely polished by an LLM. In this work, we employ a straightforward approach to identify LLM-generated reviews - doing an indirect prompt injection via the paper PDF to ask the LLM to embed a watermark. Our focus is on presenting watermarking schemes and statistical tests that maintain a bounded family-wise error rate, when a venue evaluates multiple reviews, with a higher power as compared to standard methods like Bonferroni correction. These guarantees hold without relying on any assumptions about human-written reviews. We also consider various methods for prompt injection including font embedding and jailbreaking. We evaluate the effectiveness and various tradeoffs of these methods, including different reviewer defenses. We find a high success rate in the embedding of our watermarks in LLM-generated reviews across models. We also find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice while having the power to flag LLM-generated reviews, while Bonferroni correction is infeasible.