When AI reviews science: Can we trust the referee?

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses reliability concerns in AI-assisted peer review by systematically analyzing security risks such as prompt injection, adversarial phrasing, authority and length biases, and hallucinations. We propose the first comprehensive attack taxonomy spanning training, initial screening, in-depth review, rebuttal, and system-level stages, along with a full-lifecycle security evaluation framework. Leveraging ICLR 2025 submission samples and two state-of-the-art large language models, we conduct hierarchical controlled experiments that empirically demonstrate the significant causal effects of prestige cues, assertiveness intensity, flattering rebuttals, and context contamination on review scores. Our work not only quantifies the impact of human-like manipulations but also establishes a reproducible empirical baseline and verifiable pathways for improving the trustworthiness of AI peer review systems.

Technology Category

Application Category

📝 Abstract

The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive -- and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle -- training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.

Problem

Research questions and friction points this paper is trying to address.

AI peer review

trustworthiness

adversarial attacks

bias

hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI peer review

large language models

adversarial robustness

reliability evaluation

security taxonomy

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Authors to Follow