How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unreliable confidence estimates of large reasoning models in high-stakes domains such as clinical, financial, legal, and mathematical applications. To this end, the authors introduce RMCB, the first comprehensive public benchmark for confidence estimation in large reasoning models, comprising 347,496 reasoning trajectories. They systematically evaluate over a dozen representation learning methods based on hidden states—including sequence-based, graph-structured, and text encoders—and uncover a fundamental trade-off between discriminative ability and calibration performance. Notably, increased architectural complexity does not consistently yield gains: the best AUROC (0.672) is achieved by a text encoder, while the lowest expected calibration error (ECE = 0.148) comes from a structure-aware model. No single method dominates both metrics, revealing a performance bottleneck in current paradigms.

Technology Category

Application Category

📝 Abstract
The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.
Problem

Research questions and friction points this paper is trying to address.

confidence estimation
Large Reasoning Models
calibration
high-stakes domains
reasoning traces
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence estimation
reasoning models
calibration
benchmark
representation-based methods
🔎 Similar Papers
No similar papers found.