Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit unreliable confidence estimation, undermining trustworthiness and decision-making. Method: We propose a deep reasoning framework that replaces point-answer prediction with verbalized probability distribution forecasting—jointly modeling multiple candidate answers and explicitly generating their calibrated probabilities, guided by chain-of-thought reasoning to structurally explore the full answer space. Contribution/Results: Our approach tightly couples confidence estimation with distributed reasoning, yielding outputs that are both interpretable and aligned with human judgment logic—applicable to both known and unknown answer spaces. It consistently outperforms state-of-the-art confidence calibration methods across diverse model architectures (e.g., LLaMA, Qwen) and tasks (e.g., fact verification, commonsense QA), maintains robustness after reinforcement learning fine-tuning, and produces reasoning trajectories more closely mirroring human cognitive patterns.

Technology Category

Application Category

📝 Abstract
Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.
Problem

Research questions and friction points this paper is trying to address.

Exploring how reasoning strategies impact confidence estimation in LLMs
Developing verbalized probability distributions to encourage deeper reasoning
Ensuring reliable confidence scores across models and diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts verbalized probability distribution for confidence estimation
Encourages reasoning over entire answer space candidates
Assigns confidence scores to meet distribution requirements
🔎 Similar Papers
No similar papers found.
A
Ante Wang
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Weizhi Ma
Weizhi Ma
Tsinghua University
LLM and AgentsRecommendationAI for Healthcare
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China