Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of balancing performance and interpretability in multi-LLM response selection, this paper proposes LLM-PeerReview—a fully unsupervised ensemble method. It leverages cross-model evaluation (“LLM-as-a-Judge”) to generate pairwise response scores and employs a graph-based truth inference algorithm to aggregate these scores for optimal response selection. Our approach introduces the first peer-review-inspired framework driven by LLM judges, achieving both complete unsupervised operation and transparent, interpretable decision-making. Crucially, it integrates LLM-based judging with formal truth inference—eliminating reliance on human annotations or auxiliary meta-models. Evaluated on four standard benchmarks, LLM-PeerReview consistently outperforms state-of-the-art methods, notably improving accuracy by 6.9–7.3% over Smoothie-Global. These results demonstrate its effectiveness, robustness, and strong generalization across diverse reasoning and factual QA tasks.

Technology Category

Application Category

📝 Abstract
We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.
Problem

Research questions and friction points this paper is trying to address.

Selects the best response from multiple LLM candidates
Uses a peer-review-inspired, unsupervised ensemble method
Aggregates scores via reasoning to improve model output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised ensemble method selects best LLM response
Peer-review framework with scoring, reasoning, selecting stages
Uses LLM-as-a-Judge and graphical model for aggregation
🔎 Similar Papers
No similar papers found.
Zhijun Chen
Zhijun Chen
Beihang University
Machine LearningNature Language Processing
Z
Zeyu Ji
Beihang University, Beijing, China
Qianren Mao
Qianren Mao
Zhongguancun Laboratory
Text miningText GenerationKnowledge Graph and Reasoing
H
Hao Wu
Xi’an Jiaotong University, Xi’an, China
J
Junhang Cheng
Beihang University, Beijing, China
B
Bangjie Qin
Hong Kong University of Science and Technology, Hong Kong, China
Z
Zhuoran Li
Beihang University, Beijing, China
J
Jingzheng Li
Zhongguancun Laboratory, Beijing, China
K
Kai Sun
Xi’an Jiaotong University, Xi’an, China
Z
Zizhe Wang
Tsinghua University, Beijing, China
Yikun Ban
Yikun Ban
Beihang University, University of Illinois Urbana-Champaign
Reinforcement LearningEnsemble Learning
Z
Zhu Sun
Singapore University of Technology and Design, Singapore
X
Xiangyang Ji
Tsinghua University, Beijing, China
Hailong Sun
Hailong Sun
Professor of Computer Science, Beihang University
Software EngineeringArtificial IntelligenceSoftware Systems