🤖 AI Summary
This work addresses the vulnerability of large language model (LLM)-based peer review systems to adversarial attacks, noting that existing approaches often conflate prompt injection with genuine robustness in reviewing. The authors propose Paraphrase Adversarial Attack (PAA), a black-box optimization method that generates semantically equivalent and linguistically fluent paraphrases of academic papers to inflate LLM-assigned review scores without altering the original claims. PAA uniquely focuses on adversarial attacks through semantics-preserving paraphrasing, thereby avoiding content manipulation. The study identifies increased textual perplexity as a reliable signal for detecting such attacks. Experiments across five conference datasets demonstrate PAA’s effectiveness against three LLM reviewers: attacked submissions receive significantly higher scores, human evaluators confirm high paraphrase quality, and perplexity rises markedly. Moreover, submitting paraphrased versions partially mitigates this vulnerability.
📝 Abstract
The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.