Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work investigates whether the probability-based confidence scores commonly used in Best-of-N selection genuinely reflect reasoning quality or merely capture surface-level fluency. To this end, the authors systematically evaluate the sensitivity of such confidence measures to logical structure by introducing three perturbation methods that disrupt causal dependencies among reasoning steps while preserving local fluency. Building on these insights, they propose a novel contrastive causal metric that explicitly models inter-step causal relationships. Experimental results demonstrate that conventional probability confidence is largely insensitive to causal disruptions, whereas the proposed method consistently outperforms existing approaches across multiple models and reasoning benchmarks. This study thus reveals a critical limitation of current confidence estimation mechanisms and offers a more faithful alternative for assessing reasoning quality.

Technology Category

Application Category

📝 Abstract

Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption that higher confidence reflects higher reasoning fidelity. In this work, we challenge this assumption by investigating whether these metrics truly capture inter-step causal dependencies necessary for valid reasoning. We introduce three classes of inter-step causality perturbations that systematically disrupt dependencies between reasoning steps while preserving local fluency. Surprisingly, across diverse model families and reasoning benchmarks, we find that selection accuracy degrades only marginally under these disruptions. Even severe interventions, such as applying hard attention masks that directly prevent the model from attending to prior reasoning steps, do not substantially reduce selection performance. These findings provide strong evidence that current probabilistic metrics are largely insensitive to logical structure, and primarily capture surface-level fluency or in-distribution priors instead. Motivated by this gap, we propose a contrastive causality metric that explicitly isolates inter-step causal dependencies, and demonstrate that it yields more faithful output selection than existing probability-based approaches.

Problem

Research questions and friction points this paper is trying to address.

probabilistic confidence

reasoning fidelity

inter-step causality

Best-of-N selection

logical structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

probabilistic confidence

causal dependencies

Best-of-N selection