Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work identifies a critical vulnerability: large language models (LLMs) employed in scientific paper peer review are highly susceptible to prompt injection attacks. Using 1,000 real ICLR 2024 review reports as test cases, the study systematically evaluates the sensitivity of multiple state-of-the-art LLMs to stealthy prompt injections. Results demonstrate that even simple, syntactically unobtrusive injections achieve near-perfect malicious instruction acceptance rates (~100%) across all tested models; moreover, every model exhibits pronounced directional bias (average >95%), causing systematic deviation from objective evaluation criteria. This is the first empirical study to rigorously document that LLM-based reviewers are manipulable, lacking robustness and fairness under adversarial conditions. The findings provide foundational evidence and urgent caution for the trustworthy deployment of AI-assisted academic evaluation systems, establishing a benchmark for assessing reliability and integrity in automated peer review.

Technology Category

Application Category

📝 Abstract

The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such "attacks" - although seen by some commentators as "self-defense" - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.

Problem

Research questions and friction points this paper is trying to address.

Investigates prompt injection attacks manipulating LLM review scores

Evaluates practicability and success of hidden prompt injections

Examines LLM bias toward paper acceptance in peer-review

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation using LLM-generated reviews

Testing simple prompt injection effectiveness

Measuring bias in automated review scores

🔎 Similar Papers

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models