From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Traditional peer review treats large language models (LLMs) merely as human substitutes, failing to leverage their unique capabilities for systematic evaluation. Method: We propose a pairwise comparison–based reengineering of the review paradigm, implementing a multi-round LLM-agent-driven framework for comparative paper assessment. This framework integrates controllable prompt engineering, vote aggregation, and bias calibration mechanisms to produce robust relative quality rankings. Contribution/Results: This work is the first to embed LLMs into the fundamental redesign of the review process—not as drop-in replacements but as active architectural components. Empirical evaluation demonstrates that our approach significantly outperforms human reviewers in identifying high-impact papers. Moreover, it uncovers and quantifies novel biases—including novelty decay and institutional preference—thereby establishing a new benchmark and reproducible methodology for building fair, pluralistic, AI-augmented peer review systems.

Technology Category

Application Category

📝 Abstract

The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.

Problem

Research questions and friction points this paper is trying to address.

Exploring LLM-based pairwise comparisons for peer review

Improving accuracy in measuring manuscript quality

Addressing biases in LLM-driven paper selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents perform pairwise manuscript comparisons

Aggregates outcomes for robust quality measurement

Outperforms traditional rating-based review methods

🔎 Similar Papers

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators