How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the unreliable confidence estimates of large reasoning models in high-stakes domains such as clinical, financial, legal, and mathematical applications. To this end, the authors introduce RMCB, the first comprehensive public benchmark for confidence estimation in large reasoning models, comprising 347,496 reasoning trajectories. They systematically evaluate over a dozen representation learning methods based on hidden states—including sequence-based, graph-structured, and text encoders—and uncover a fundamental trade-off between discriminative ability and calibration performance. Notably, increased architectural complexity does not consistently yield gains: the best AUROC (0.672) is achieved by a text encoder, while the lowest expected calibration error (ECE = 0.148) comes from a structure-aware model. No single method dominates both metrics, revealing a performance bottleneck in current paradigms.

Technology Category

Application Category

📝 Abstract

The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.

Problem

Research questions and friction points this paper is trying to address.

confidence estimation

Large Reasoning Models

calibration

high-stakes domains

reasoning traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence estimation

reasoning models

calibration