Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of answer selection in multi-LLM systems—commonly reliant on external verifiers, human evaluation, or repeated sampling—this paper proposes a lightweight uncertainty-aware answer selection method. It leverages calibrated log-likelihood scores derived from each model’s output to perform implicit uncertainty modeling, eliminating the need for auxiliary verifiers or redundant sampling. The approach uniformly supports both debate-based and non-debate reasoning paradigms. Empirical evaluation shows consistent improvements of approximately 4%, 3%, and 5% on GSM8K, MMLU, and ARC, respectively, significantly outperforming self-consistency and state-of-the-art multi-model selection baselines. The core contribution lies in the first systematic use of intra-model calibrated likelihood scores as a cross-model comparable reliability metric—enabling efficient, robust decision-making under resource constraints.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.
Problem

Research questions and friction points this paper is trying to address.

Selecting reliable responses from multiple LLMs
Overcoming limitations of costly external verification methods
Improving reasoning accuracy in resource-constrained multi-LLM systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty-aware answer selection for multi-LLM systems
Computationally efficient calibrated log-likelihood scoring method
Leverages inherent model knowledge without external verifiers
🔎 Similar Papers
No similar papers found.