When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work systematically evaluates the reliability and risks of large language models (LLMs) as academic peer review “assistants,” focusing on fairness, consistency, and robustness against adversarial prompt injection attacks. Method: We introduce, for the first time, a PDF-level indirect prompt injection technique—integrating structured prompting, reference-paper calibration, topic modeling, and semantic similarity analysis—to empirically assess models including GPT-5-mini. Contribution/Results: Results show that while LLMs assign scores to high-quality papers closely aligned with human reviewers, they exhibit score inflation for weak papers. Critically, their evaluations are vulnerable to domain-specific covert instructions, causing misalignment in review focus. Inter-rater agreement with human reviewers is strongly contingent on paper quality. This study uncovers fundamental vulnerabilities in LLM-assisted peer review, providing critical empirical evidence and a methodological foundation for developing trustworthy, AI-augmented scholarly evaluation systems.

Technology Category

Application Category

📝 Abstract

Peer review is the cornerstone of academic publishing, yet the process is increasingly strained by rising submission volumes, reviewer overload, and expertise mismatches. Large language models (LLMs) are now being used as "reviewer aids," raising concerns about their fairness, consistency, and robustness against indirect prompt injection attacks. This paper presents a systematic evaluation of LLMs as academic reviewers. Using a curated dataset of 1,441 papers from ICLR 2023 and NeurIPS 2022, we evaluate GPT-5-mini against human reviewers across ratings, strengths, and weaknesses. The evaluation employs structured prompting with reference paper calibration, topic modeling, and similarity analysis to compare review content. We further embed covert instructions into PDF submissions to assess LLMs' susceptibility to prompt injection. Our findings show that LLMs consistently inflate ratings for weaker papers while aligning more closely with human judgments on stronger contributions. Moreover, while overarching malicious prompts induce only minor shifts in topical focus, explicitly field-specific instructions successfully manipulate specific aspects of LLM-generated reviews. This study underscores both the promises and perils of integrating LLMs into peer review and points to the importance of designing safeguards that ensure integrity and trust in future review processes.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM bias and rating alignment in peer review

Assessing LLM vulnerability to indirect prompt injection attacks

Analyzing LLM-human reviewer divergence across paper strengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured prompting with reference calibration

Embedding covert instructions in PDFs

Topic modeling and similarity analysis

🔎 Similar Papers

No similar papers found.