Trace Length is a Simple Uncertainty Signal in Reasoning Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Reliable deployment of large language models (LLMs) is hindered by challenges in quantifying inference uncertainty and mitigating hallucination risks. Method: This paper identifies chain-of-thought (CoT) length as a simple, stable, zero-shot proxy for confidence—evaluated across diverse models (e.g., LLaMA, Qwen), tasks (mathematical and commonsense reasoning), and prompting schemes. We analyze how post-training alters the CoT-length–accuracy relationship, revealing that it introduces high-entropy “branching” tokens that reconfigure uncertainty expression pathways. Using controlled ablation studies and GRPO-based bias correction, we rigorously validate CoT length’s strong correlation with true confidence. Results: CoT length achieves uncertainty estimation performance on par with verbalized confidence, and their combination yields significant gains in calibration accuracy. This work provides both a novel theoretical perspective on trustworthy reasoning and a practical, parameter-free tool for uncertainty quantification in LLM inference.

Technology Category

Application Category

📝 Abstract

Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., "overthinking"). We investigate the mechanisms behind trace length's performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or "forking" tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.

Problem

Research questions and friction points this paper is trying to address.

Trace length serves as uncertainty signal for reasoning models

It provides zero-shot confidence estimation comparable to verbalized methods

Mechanism involves high-entropy tokens and persists after adjusting confounders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trace length serves as confidence estimator

Post-training alters trace length-accuracy relationship

High-entropy tokens drive trace length mechanism

🔎 Similar Papers

Techniques for Measuring the Inferential Strength of Forgetting Policies