Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work presents the first systematic evaluation of large language models’ (LLMs) robustness against textual adversarial attacks in automated peer review. We investigate three attack types—synonym substitution, syntactic perturbation, and semantics-preserving rewriting—across mainstream LLMs (GPT-4, Claude, Llama), benchmarking their behavior against human reviewers. To this end, we propose a novel adversarial evaluation framework tailored to academic review, introducing three reliability dimensions: review consistency, decision stability, and quality sensitivity. Experimental results reveal that minor input perturbations induce decision flips (accept/reject reversal) in 37–62% of cases, exposing critical fragility in LLMs’ scholarly gatekeeping capability. Our study identifies a key security vulnerability in AI-assisted peer review and establishes a reproducible evaluation paradigm—along with concrete mitigation directions—for developing trustworthy, robust LLM-based review systems.

Technology Category

Application Category

📝 Abstract
Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM vulnerability in automated peer review under adversarial attacks
Evaluating effectiveness and reliability of LLM-generated reviews versus humans
Exploring mitigation strategies for adversarial risks in AI-based reviewing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Assessing LLM robustness in peer review
Evaluating adversarial attack impacts on reviews
Proposing mitigation for LLM review vulnerabilities
🔎 Similar Papers
No similar papers found.
T
Tzu-Ling Lin
National Yang Ming Chiao Tung University
Wei-Chih Chen
Wei-Chih Chen
National Taiwan University
speech processingdiffusion model
T
Teng-Fang Hsiao
National Yang Ming Chiao Tung University
Hou-I Liu
Hou-I Liu
NYCU
Computer Vision
Y
Ya-Hsin Yeh
National Yang Ming Chiao Tung University
Y
Yu Kai Chan
National Yang Ming Chiao Tung University
W
Wen-Sheng Lien
National Yang Ming Chiao Tung University
P
Po-Yen Kuo
National Yang Ming Chiao Tung University
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy
Hong-Han Shuai
Hong-Han Shuai
National Yang Ming Chiao Tung University
Deep LearningData MiningMultimedia Processing