PiCO: Peer Review in LLMs based on the Consistency Optimization

📅 2024-02-02
📈 Citations: 5
Influential: 3
📄 PDF
🤖 AI Summary
Current LLM evaluation heavily relies on human annotations and closed benchmarks, lacking scalable, unsupervised paradigms. Method: We propose the first peer-review–based unsupervised automatic evaluation framework: open- and closed-source models jointly answer questions anonymously without ground-truth labels and reciprocally score each other’s responses; model capability parameters are jointly optimized via consistency regularization to induce a capability hierarchy. Contribution/Results: We introduce three novel metrics—Peer Evaluation Normalization (PEN), Consistency Index (CIN), and Label-Independent Sorting (LIS)—to quantify alignment between model rankings and human judgments. We further incorporate multi-model response consistency regularization and learnable capability parameterization. Extensive experiments across multiple datasets demonstrate that our method significantly outperforms baselines, with PEN, CIN, and LIS showing strong correlation with human evaluations—thereby overcoming key limitations of conventional LLM assessment.

Technology Category

Application Category

📝 Abstract
Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these models, we assign each LLM a learnable capability parameter to adjust the final ranking. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. The key assumption behind is that high-level LLM can evaluate others' answers more accurately than low-level ones, while higher-level LLM can also achieve higher response scores. Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings. We perform experiments on multiple datasets with these metrics, validating the effectiveness of the proposed approach.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised evaluation of LLMs
Peer-review mechanism for model ranking
Optimization of consistency in LLM capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised peer-review mechanisms
Learnable capability parameter optimization
Three metrics for ranking alignment
🔎 Similar Papers
No similar papers found.
Kun-Peng Ning
Kun-Peng Ning
Peking University
Machine LearningLLMs
S
Shuo Yang
School of Future Technology, Tianjin University
Y
Yu-Yang Liu
School of Electrical and Computer Engineering, Peking University
J
Jia-Yu Yao
School of Electrical and Computer Engineering, Peking University
Z
Zhen-Hui Liu
School of Electrical and Computer Engineering, Peking University
Y
Yong-Hong Tian
Yibing Song
Yibing Song
Deputy Chief Engineer, BYD Group
Multi-Modal AI
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants