Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing confidence estimation methods for chain-of-thought (CoT) reasoning in black-box settings are computationally expensive and overlook the geometric structure of reasoning trajectories. This work addresses these limitations by modeling reasoning trajectories as sliding-window sequences and introducing a novel three-channel fusion framework—combining geometric convergence, coverage, and verbalized confidence signals—that requires neither access to internal model states nor supervised calibrators. Evaluated across six benchmark–reasoner combinations, the proposed method achieves superior performance with only K=4 samples, outperforming self-consistency with K=8 by a median AUC improvement of 0.075. The fused three-channel approach consistently surpasses any single channel in 17 out of 18 experimental settings, attaining a peak AUC of 0.92, thereby demonstrating both efficiency and reliability in confidence assessment.

📝 Abstract

Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, fusing this score with coverage and verbalized-confidence channels at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71, deltaAUC=+0.075). A fixed-pick control (+0.060) and E5 cross-embedder replication rule out answer switching and single-vendor artifacts. Geometry peaks in the penultimate window across benchmarks and reasoners, and inverts at the terminal window on GPQA Diamond. Three unscaffolded regimes separate black-box confidence into a judge-mediated Coverage prior (C), within-trace Geometry (G), and a conditional Verbalization channel (V). Across 18 benchmark x reasoner x proposer settings, C and G provide independent signal in 18/18 and 16/18, while V contributes residual signal in 6/18. Swapping the judge from GPT-5-mini to Claude Sonnet 4.6 leaves G-only AUC unchanged (|delta|<=0.013) and shifts C-only AUC by at most +/-0.02 (kappa=0.82). Fusion beats the best single channel in 17/18 settings (median AUC 0.78, max 0.92).

Problem

Research questions and friction points this paper is trying to address.

confidence estimation

chain-of-thought reasoning

black-box models

reasoning trajectories

safe deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-confidence

black-box confidence

reasoning geometry