When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This study systematically investigates the robustness of LLM-based scientific peer review systems—including both illicit and legitimate platforms—against PDF-level indirect prompt injection attacks, focusing on a novel security threat: malicious reversal of editorial decisions from “rejection” to “acceptance.” Method: We propose WAVS (Weighted Adversarial Vulnerability Score), a new metric quantifying model susceptibility; construct the first academic-review-specific adversarial dataset comprising 200 papers; and design 15 domain-adapted, PDF-native indirect injection strategies—including semantic obfuscation and steganographic techniques. Contribution/Results: We evaluate 13 mainstream models (e.g., GPT-5, Claude Haiku, DeepSeek), achieving high decision-reversal rates. All data and the injection framework are publicly released, establishing a benchmark resource and methodological foundation for AI-assisted peer review security research.

Technology Category

Application Category

📝 Abstract

The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.

Problem

Research questions and friction points this paper is trying to address.

Assesses LLM-based peer review systems' vulnerability to adversarial PDF manipulation.

Develops a novel metric to quantify success in flipping reject decisions to accept.

Evaluates domain-specific attack strategies across multiple large language models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial PDF manipulation to flip decisions

Novel WAVS metric for vulnerability quantification

Domain-specific attack strategies across multiple models

🔎 Similar Papers

No similar papers found.