Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

šŸ“… 2024-10-16
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
To address bottlenecks in LLM evaluation—including high human annotation costs, rigid task formatting, reliance on reference answers, and systemic biases—this paper proposes AutoReview, an automated peer-review framework. Methodologically, AutoReview introduces the first LLM auto-selection mechanism grounded in three dimensions: consistency, relevance, and confidence—spanning instruction understanding, content alignment, and response discrimination. It integrates the LLM-as-judge paradigm, multi-dimensional capability quantification, structured qualification exams, and task-adaptive evaluator selection. Empirically, AutoReview achieves state-of-the-art performance on three diverse tasks—summarization, non-factual QA, and dialogue generation—while substantially reducing evaluation cost. Moreover, it demonstrates strong scalability and cross-task generalization capability, enabling robust, reference-free, and format-agnostic LLM assessment.

Technology Category

Application Category

šŸ“ Abstract
The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats, dependence on human references, and systematic biases. To address these limitations, we propose Auto-PRE, an automatic LLM evaluation framework inspired by the peer review process. Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluator LLMs based on three core traits: consistency, pertinence, and self-confidence, which correspond to the instruction, content, and response stages, respectively, and collectively cover the entire evaluation process. Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance while significantly reducing evaluation costs. Furthermore, the structured and scalable design of our automatic qualification exam framework provides valuable insights into automating the evaluation of LLMs-as-judges, paving the way for more advanced LLM-based evaluation frameworks.
Problem

Research questions and friction points this paper is trying to address.

Automates LLM evaluation to reduce human annotation costs
Addresses systematic biases in traditional language generation assessment
Selects optimal evaluator models through multi-stage qualification traits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically selects evaluator LLMs for peer review
Uses consistency, pertinence, and self-confidence traits
Achieves state-of-the-art performance with reduced costs
šŸ”Ž Similar Papers
No similar papers found.
J
Junjie Chen
Department of Computer Science and Technology, Tsinghua University
Weihang Su
Weihang Su
Tsinghua University
Information RetrievalNatural Language ProcessingAI for Legal
Zhumin Chu
Zhumin Chu
PhD. student, Tsinghua University
information retrievaluser studyevaluation
H
Haitao Li
Department of Computer Science and Technology, Tsinghua University
Y
Yujia Zhou
Department of Computer Science and Technology, Tsinghua University
D
Dingbo Yuan
Ant Group
X
Xudong Wang
Ant Group
J
Jun Zhou
Ant Group
Y
Yiqun Liu
Department of Computer Science and Technology, Tsinghua University
M
Min Zhang
Department of Computer Science and Technology, Tsinghua University
S
Shaoping Ma
Department of Computer Science and Technology, Tsinghua University
Qingyao Ai
Qingyao Ai
Associate Professor, Dept. of CS&T, Tsinghua University
Information RetrievalMachine Learning