How Uncertainty Estimation Scales with Sampling in Reasoning Models

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the lack of reliable and scalable uncertainty estimation methods for reasoning language models in complex scenarios. The authors propose a black-box hybrid estimator based on parallel sampling that integrates verbalized confidence with self-consistency signals. Systematic evaluation across three prominent reasoning models and seventeen diverse tasks demonstrates that these two signals exhibit complementary strengths depending on the number of samples, with significant improvements in uncertainty calibration achievable using as few as two samples. The method yields an average AUROC gain of 12 points, achieving the strongest performance on mathematical reasoning tasks and revealing notable disparities in effectiveness between STEM and humanities domains.

Technology Category

Application Category

📝 Abstract

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

Problem

Research questions and friction points this paper is trying to address.

uncertainty estimation

reasoning models

chain-of-thought

sampling

self-consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty estimation

parallel sampling

self-consistency