Do LLMs Favor LLMs? Quantifying Interaction Effects in Peer Review

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) exhibit a preference for LLM-generated papers during peer review and examines the interaction between LLM-assisted reviewing and paper quality. Leveraging over 125,000 paper–review pairs from ICLR, NeurIPS, and ICML, the work combines large-scale observational analysis, controlled regression, simulated LLM-generated reviews, and meta-reviewer behavior modeling. It reveals that the apparent “LLM preference” is actually attributable to distributional biases in low-quality submissions rather than inherent favoritism. The findings show that LLM-augmented human reviewers mitigate leniency bias and improve discriminative power, whereas fully LLM-generated reviews suffer from severe score compression. Moreover, during meta-reviewing, human reviewers using LLM assistance are more inclined to accept papers, while fully automated LLM decisions adopt a stricter stance.

Technology Category

Application Category

📝 Abstract

There are increasing indications that LLMs are not only used for producing scientific papers, but also as part of the peer review process. In this work, we provide the first comprehensive analysis of LLM use across the peer review pipeline, with particular attention to interaction effects: not just whether LLM-assisted papers or LLM-assisted reviews are different in isolation, but whether LLM-assisted reviews evaluate LLM-assisted papers differently. In particular, we analyze over 125,000 paper-review pairs from ICLR, NeurIPS, and ICML. We initially observe what appears to be a systematic interaction effect: LLM-assisted reviews seem especially kind to LLM-assisted papers compared to papers with minimal LLM use. However, controlling for paper quality reveals a different story: LLM-assisted reviews are simply more lenient toward lower quality papers in general, and the over-representation of LLM-assisted papers among weaker submissions creates a spurious interaction effect rather than genuine preferential treatment of LLM-generated content. By augmenting our observational findings with reviews that are fully LLM-generated, we find that fully LLM-generated reviews exhibit severe rating compression that fails to discriminate paper quality, while human reviewers using LLMs substantially reduce this leniency. Finally, examining metareviews, we find that LLM-assisted metareviews are more likely to render accept decisions than human metareviews given equivalent reviewer scores, though fully LLM-generated metareviews tend to be harsher. This suggests that meta-reviewers do not merely outsource the decision-making to the LLM. These findings provide important input for developing policies that govern the use of LLMs during peer review, and they more generally indicate how LLMs interact with existing decision-making processes.

Problem

Research questions and friction points this paper is trying to address.

LLM

peer review

interaction effects

scientific publishing

review bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted peer review

interaction effects

rating compression