Do Reasoning Models Show Better Verbalized Calibration?

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work investigates whether large reasoning models (LRMs) exhibit superior verbal confidence calibration compared to instruction-tuned models on complex reasoning and factual tasks. We systematically evaluate two LRM families—supervised fine-tuning (SFT)-distilled and outcome-based reinforcement learning (RL)-trained models—using multiple calibration metrics, including expected calibration error (ECE) and confidence-accuracy alignment. Our key finding is that LRM calibration is highly task-dependent: on complex reasoning tasks, LRMs outperform instruction-tuned models in both accuracy and calibration; conversely, on factual tasks, we observe a counterintuitive pattern—smaller LRMs (e.g., QwQ-32B) show no calibration improvement, SFT variants are overconfident, while RL training markedly enhances self-awareness and credibility of confidence estimates. This study uncovers critical interactions among reasoning paradigms, model scale, and task type in shaping calibration behavior, providing empirical foundations for designing trustworthy reasoning models.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) have recently shown impressive capabilities in complex reasoning by leveraging increased test-time computation and exhibiting behaviors akin to human-like deliberation. Despite these advances, it remains an open question whether LRMs are better calibrated - particularly in their verbalized confidence - compared to instruction-tuned counterparts. In this paper, we investigate the calibration properties of LRMs trained via supervised fine-tuning distillation on long reasoning traces (henceforth SFT reasoning models) and outcome-based reinforcement learning for reasoning (henceforth RL reasoning models) across diverse domains. Our findings reveal that LRMs significantly outperform instruction-tuned models on complex reasoning tasks in both accuracy and confidence calibration. In contrast, we find surprising trends in the domain of factuality in particular. On factuality tasks, while Deepseek-R1 shows strong calibration behavior, smaller QwQ-32B shows no improvement over instruct models; moreover, SFT reasoning models display worse calibration (greater overconfidence) compared to instruct models. Our results provide evidence for a potentially critical role of reasoning-oriented RL training in improving LLMs' capacity for generating trustworthy, self-aware outputs.

Problem

Research questions and friction points this paper is trying to address.

Assess calibration of large reasoning models' verbalized confidence

Compare reasoning models vs instruction-tuned models on accuracy

Investigate RL training's role in improving trustworthy outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised fine-tuning distillation on reasoning traces

Outcome-based reinforcement learning for reasoning

Improved calibration via reasoning-oriented RL training

🔎 Similar Papers

QA-Calibration of Language Model Confidence Scores