AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe test-set contamination, poor adaptability of static benchmarks, and limited capacity for dynamic task generation in LLM evaluation, this paper proposes a fully automated, contamination-resistant distributed evaluation framework. Methodologically, it introduces a multi-agent mutual-evaluation mechanism wherein models alternately assume the roles of “task generator” and “evaluator”; task generation and closed-loop assessment are enabled via cyclic weighting, consensus-based aggregation, and iterative reliability calibration. Crucially, the framework eliminates reliance on fixed test sets and enhances robustness and human alignment through collaborative judgment by multiple evaluators. Empirical results show Pearson correlations of 78% with human ratings on MMLU-Pro and 63% on GPQA—substantially outperforming single-evaluator baselines—demonstrating both effectiveness and strong generalization across diverse reasoning-intensive benchmarks.

Technology Category

Application Category

📝 Abstract
We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.. Unlike static benchmarks that suffer from test-set contamination and limited adaptability, AutoBench dynamically generates novel evaluation tasks while models alternately serve as question generators, contestants, and judges across diverse domains. An iterative weighting mechanism amplifies the influence of consistently reliable evaluators, aggregating peer judgments into consensus-based rankings that reflect collective model agreement. Our experiments demonstrate strong correlations with established benchmarks including MMLU-Pro and GPQA (respectively 78% and 63%), validating this peer-driven evaluation paradigm. The multi-judge design significantly outperforms single-judge baselines, confirming that distributed evaluation produces more robust and human-consistent assessments. AutoBench offers a scalable, contamination-resistant alternative to static benchmarks for the continuous evaluation of evolving language models.
Problem

Research questions and friction points this paper is trying to address.

Automating LLM evaluation via reciprocal peer assessment
Addressing test-set contamination in static benchmarks
Generating dynamic tasks for robust model comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates LLM evaluation via reciprocal peer assessment
Dynamically generates tasks with rotating model roles
Uses iterative weighting for consensus-based model rankings
🔎 Similar Papers
No similar papers found.
D
Dario Loi
Department of Computer Science, Sapienza University of Rome
E
Elena Maria Muià
Department of Computer, Control, and Management Engineering, Sapienza University of Rome
Federico Siciliano
Federico Siciliano
Post-doc, Sapienza University of Rome
Explainable Artificial IntelligenceRecommender Systems
Giovanni Trappolini
Giovanni Trappolini
Department of Computer, Control, and Management Engineering, Sapienza University of Rome
V
Vincenzo Crisà
Department of Computer, Control, and Management Engineering, Sapienza University of Rome
P
Peter Kruger
eZecute S.R.L.
Fabrizio Silvestri
Fabrizio Silvestri
Sapienza, University of Rome
Machine LearningArtificial IntelligenceNatural Language ProcessingRAGWeb