LLM-REVal: Can We Trust LLM Reviewers Yet?

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study identifies fairness risks in academic peer review arising from large language models (LLMs): when LLMs serve concurrently as paper authors and reviewers, they exhibit systematic scoring bias—overrating LLM-generated manuscripts while underrating human-authored papers containing critical statements. To investigate this, we introduce the first LLM-based research agent–reviewer agent co-simulation framework, integrating multi-round generation–revision–review loops with human-annotated ground-truth validation. Our empirical analysis is the first to identify two distinct biases in LLM review behavior: (1) linguistic feature preference bias (e.g., favoring fluency, length, and syntactic complexity) and (2) aversion to critical discourse—particularly skepticism toward methodological limitations or contradictory evidence. Although these biases undermine review fairness and equity, LLM-generated feedback remains effective in improving manuscript quality, especially for early-career researchers and low-quality submissions.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness, examining the potential risks of using LLMs as reviewers by simulation. This simulation incorporates a research agent, which generates papers and revises, alongside a review agent, which assesses the submissions. Based on the simulation results, we conduct human annotations and identify pronounced misalignment between LLM-based reviews and human judgments: (1) LLM reviewers systematically inflate scores for LLM-authored papers, assigning them markedly higher scores than human-authored ones; (2) LLM reviewers persistently underrate human-authored papers with critical statements (e.g., risk, fairness), even after multiple revisions. Our analysis reveals that these stem from two primary biases in LLM reviewers: a linguistic feature bias favoring LLM-generated writing styles, and an aversion toward critical statements. These results highlight the risks and equity concerns posed to human authors and academic research if LLMs are deployed in the peer review cycle without adequate caution. On the other hand, revisions guided by LLM reviews yield quality gains in both LLM-based and human evaluations, illustrating the potential of the LLMs-as-reviewers for early-stage researchers and enhancing low-quality papers.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM bias in peer review through simulation studies

Identifying score inflation for LLM-authored papers versus human ones

Assessing risks of LLM reviewers undervaluing critical human research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulates research and review agents interaction

Identifies biases in LLM reviewers through human annotation

Evaluates LLM review impact on paper quality

🔎 Similar Papers

No similar papers found.