PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of large language models rely heavily on human-curated benchmarks and fixed reference answers, which struggle to capture the complexities of open-domain, real-world scenarios requiring up-to-date information retrieval. This work proposes PeerRank—the first fully autonomous, multi-agent peer evaluation framework—that jointly generates tasks, performs category-constrained real-time web search, conducts debiased peer reviews, and aggregates dense scores without any human supervision or ground-truth references. By modeling evaluation as a symmetric, dynamic multi-role interaction process, PeerRank produces stable and discriminative rankings across 12 commercial models and 420 synthetically generated questions. The framework effectively uncovers identity and presentation biases and demonstrates significant correlation with objective accuracy metrics on established benchmarks such as TruthfulQA and GSM8K.

Technology Category

Application Category

📝 Abstract
Evaluating large language models typically relies on human-authored benchmarks, reference answers, and human or single-model judgments, approaches that scale poorly, become quickly outdated, and mismatch open-world deployments that depend on web retrieval and synthesis. We introduce PeerRank, a fully autonomous end-to-end evaluation framework in which models generate evaluation tasks, answer them with category-scoped live web grounding, judge peer responses and aggregate dense peer assessments into relative performance estimates, without human supervision or gold references. PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments. In a large-scale study over 12 commercially available models and 420 autonomously generated questions, PeerRank produces stable, discriminative rankings and reveals measurable identity and presentation biases. Rankings are robust, and mean peer scores agree with Elo. We further validate PeerRank on TruthfulQA and GSM8K, where peer scores correlate with objective accuracy. Together, these results suggest that bias-aware peer evaluation with selective web-grounded answering can scale open-world LLM assessment beyond static and human curated benchmarks.
Problem

Research questions and friction points this paper is trying to address.

LLM evaluation
human benchmarks
open-world deployment
web retrieval
bias in evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

autonomous evaluation
peer review
web-grounded reasoning
bias mitigation
multi-agent LLM assessment
🔎 Similar Papers
No similar papers found.
Y
Yanki Margalit
Caura.ai
E
Erni Avram
Caura.ai
R
Ran Taig
Caura.ai
Oded Margalit
Oded Margalit
BGU
Computer science
N
Nurit Cohen-Inger
Computer Science and Information, Ben-Gurion University of the Negev