Can Large Language Models Be Trusted Paper Reviewers? A Feasibility Study

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Can large language models (LLMs) effectively perform academic paper reviewing? This work presents the first systematic, end-to-end evaluation of LLMs’ reviewing capabilities on real conference submissions (WASA 2024). We propose an automated reviewing system integrating retrieval-augmented generation (RAG), the AutoGen multi-agent framework, and chain-of-thought prompting to support format checking, standardized scoring, and structured review generation. Experiments show the system requires an average of 2.48 hours per paper at a cost of $104.28. Crucially, LLM-recommended acceptances align with actual acceptances in only 38.6% of cases—substantially lower than human reviewer agreement—demonstrating that LLMs lack autonomous decision-making capability. Our core contribution is establishing a reproducible LLM-assisted reviewing paradigm and empirically validating “human-AI collaboration” as the optimal current approach: LLMs serve best as efficient, controllable auxiliary tools—not replacements—for human reviewers.

Technology Category

Application Category

📝 Abstract
Academic paper review typically requires substantial time, expertise, and human resources. Large Language Models (LLMs) present a promising method for automating the review process due to their extensive training data, broad knowledge base, and relatively low usage cost. This work explores the feasibility of using LLMs for academic paper review by proposing an automated review system. The system integrates Retrieval Augmented Generation (RAG), the AutoGen multi-agent system, and Chain-of-Thought prompting to support tasks such as format checking, standardized evaluation, comment generation, and scoring. Experiments conducted on 290 submissions from the WASA 2024 conference using GPT-4o show that LLM-based review significantly reduces review time (average 2.48 hours) and cost (average $104.28 USD). However, the similarity between LLM-selected papers and actual accepted papers remains low (average 38.6%), indicating issues such as hallucination, lack of independent judgment, and retrieval preferences. Therefore, it is recommended to use LLMs as assistive tools to support human reviewers, rather than to replace them.
Problem

Research questions and friction points this paper is trying to address.

Exploring LLMs' feasibility for automating academic paper reviews
Assessing LLMs' accuracy in selecting and evaluating research papers
Proposing LLM-assisted review systems to reduce time and cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Retrieval Augmented Generation for review
Integrates AutoGen multi-agent system
Applies Chain-of-Thought prompting technique
🔎 Similar Papers
No similar papers found.
Chuanlei Li
Chuanlei Li
shandong university
Decentralized Storage NetworkBlockchain
X
Xu Hu
Department of Computer Science, University of Texas at Dallas
M
Minghui Xu
School of Computer Science and Technology, Shandong University
K
Kun Li
School of Computer Science and Technology, Shandong University
Y
Yue Zhang
School of Computer Science and Technology, Shandong University
Xiuzhen Cheng
Xiuzhen Cheng
School of Computer Science and Technology, Shandong University
BlockchainIoT SecurityEdge ComputingDistributed Computing