Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This paper identifies a novel, stealthy threat to LLM-based peer review systems: authors can embed human-imperceptible adversarial steganographic text within PDF manuscripts to bias LLM-generated review recommendations. Method: We formally define three motivation-driven threat models and design a steganographic prompt injection attack with cross-model, cross-prompt, and cross-paper generalizability. Leveraging a user study, we construct realistic review prompt templates and empirically validate the attack on mainstream commercial LLMs. We further propose evasion strategies that bypass existing automated detection mechanisms. Contribution/Results: Our attack reliably manipulates LLM outputs across diverse settings, posing a tangible risk to reviewers relying on automated assistance. The findings underscore urgent needs for enhanced robustness evaluation and security auditing of LLMs in high-stakes academic applications.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

Problem

Research questions and friction points this paper is trying to address.

Investigating hidden prompt injection attacks on LLM-assisted peer review

Assessing adversarial text manipulation in scientific paper PDFs

Evaluating LLM vulnerability to stealthy author influence attempts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hidden prompt injection attacks on PDFs

Design adversarial prompts invisible to humans

Reduce detectability under automated content checks

🔎 Similar Papers

No similar papers found.