🤖 AI Summary
When large language models generate multiple candidate responses, conventional approaches such as majority voting or probability-based selection often fail to reliably identify the optimal answer when the correct response diverges from the superficial majority. To address this, this work proposes the Radial Consensus Score (RCS), which constructs a semantic centroid by computing the weighted Fréchet mean of candidate answer embeddings and ranks responses based on their radial distances to this centroid. RCS requires no training, is compatible with black-box models, and supports flexible weighting strategies to integrate both confidence and consistency signals. Evaluated across seven benchmarks spanning short-answer and long-reasoning tasks and five open-source large language models, RCS consistently outperforms existing methods, with its advantage amplifying as sampling size increases, offering a robust alternative to majority voting in multi-agent debate settings.
📝 Abstract
Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.