Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses the challenge that large language models often produce poorly calibrated confidence estimates in single-generation settings, with existing approaches typically requiring labeled data or multiple generations, thereby limiting practical applicability. The authors propose an unsupervised calibration method that constructs a self-consistency–based proxy target offline from unlabeled data and distills it into a lightweight confidence predictor. This approach achieves reliable confidence calibration without labels and using only a single model generation—setting a new precedent in the field. Evaluated across five mathematical and question-answering benchmarks and nine reasoning models, the method significantly outperforms baseline techniques, enhances performance in selective prediction and downstream decision-making tasks, and demonstrates strong out-of-distribution generalization capabilities.

Technology Category

Application Category

📝 Abstract

Reasoning language models can solve increasingly complex tasks, but struggle to produce the calibrated confidence estimates necessary for reliable deployment. Existing calibration methods usually depend on labels or repeated sampling at inference time, making them impractical in many settings. We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.

Problem

Research questions and friction points this paper is trying to address.

confidence calibration

reasoning LLMs

unsupervised learning

single generation

reliable deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised calibration

single-generation inference

self-consistency