🤖 AI Summary
Addressing the challenge of balancing performance and interpretability in multi-LLM response selection, this paper proposes LLM-PeerReview—a fully unsupervised ensemble method. It leverages cross-model evaluation (“LLM-as-a-Judge”) to generate pairwise response scores and employs a graph-based truth inference algorithm to aggregate these scores for optimal response selection. Our approach introduces the first peer-review-inspired framework driven by LLM judges, achieving both complete unsupervised operation and transparent, interpretable decision-making. Crucially, it integrates LLM-based judging with formal truth inference—eliminating reliance on human annotations or auxiliary meta-models. Evaluated on four standard benchmarks, LLM-PeerReview consistently outperforms state-of-the-art methods, notably improving accuracy by 6.9–7.3% over Smoothie-Global. These results demonstrate its effectiveness, robustness, and strong generalization across diverse reasoning and factual QA tasks.
📝 Abstract
We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.