Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Addressing the challenge of balancing performance and interpretability in multi-LLM response selection, this paper proposes LLM-PeerReview—a fully unsupervised ensemble method. It leverages cross-model evaluation (“LLM-as-a-Judge”) to generate pairwise response scores and employs a graph-based truth inference algorithm to aggregate these scores for optimal response selection. Our approach introduces the first peer-review-inspired framework driven by LLM judges, achieving both complete unsupervised operation and transparent, interpretable decision-making. Crucially, it integrates LLM-based judging with formal truth inference—eliminating reliance on human annotations or auxiliary meta-models. Evaluated on four standard benchmarks, LLM-PeerReview consistently outperforms state-of-the-art methods, notably improving accuracy by 6.9–7.3% over Smoothie-Global. These results demonstrate its effectiveness, robustness, and strong generalization across diverse reasoning and factual QA tasks.

Technology Category

Application Category

📝 Abstract

We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

Problem

Research questions and friction points this paper is trying to address.

Selects the best response from multiple LLM candidates

Uses a peer-review-inspired, unsupervised ensemble method

Aggregates scores via reasoning to improve model output

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised ensemble method selects best LLM response

Peer-review framework with scoring, reasoning, selecting stages

Uses LLM-as-a-Judge and graphical model for aggregation

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks