PiCSAR: Probabilistic Confidence Selection And Ranking

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reasoning tasks, the absence of ground-truth answer labels hinders reliable identification of correct reasoning chains among multiple candidates. Method: This paper proposes a training-free scoring function that decomposes the joint log-likelihood into two interpretable components—reasoning confidence and answer confidence—enabling unsupervised discrimination between correct and incorrect reasoning paths. Leveraging generative outputs from large language models (LLMs) and large reasoning models (LRMs), the method integrates best-of-n sampling with confidence-aware analysis to substantially reduce sampling overhead. Results: The approach achieves +10.18 and +9.81 accuracy gains on MATH500 and AIME2025, respectively; outperforms existing methods in 16 out of 20 ablation and comparative experiments; and reduces average sampling cost by over 50%, striking a favorable balance between inference efficiency and accuracy.

Technology Category

Application Category

📝 Abstract
Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
Problem

Research questions and friction points this paper is trying to address.

Scoring candidate solutions without ground-truth answers
Identifying correct reasoning chains in LLMs
Improving accuracy with fewer generated samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses joint log-likelihood scoring method
Training-free confidence selection technique
Decomposes into reasoning and answer confidence
🔎 Similar Papers
No similar papers found.