🤖 AI Summary
In reasoning tasks, the absence of ground-truth answer labels hinders reliable identification of correct reasoning chains among multiple candidates. Method: This paper proposes a training-free scoring function that decomposes the joint log-likelihood into two interpretable components—reasoning confidence and answer confidence—enabling unsupervised discrimination between correct and incorrect reasoning paths. Leveraging generative outputs from large language models (LLMs) and large reasoning models (LRMs), the method integrates best-of-n sampling with confidence-aware analysis to substantially reduce sampling overhead. Results: The approach achieves +10.18 and +9.81 accuracy gains on MATH500 and AIME2025, respectively; outperforms existing methods in 16 out of 20 ablation and comparative experiments; and reduces average sampling cost by over 50%, striking a favorable balance between inference efficiency and accuracy.
📝 Abstract
Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.