Thought calibration: Efficient and confident test-time scaling

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the high computational cost of large language models (LLMs) induced by prolonged reasoning during inference. We propose a dynamic reasoning termination mechanism. Methodologically, we introduce (1) a novel “reasoning calibration” framework that models inference as a nested reasoning tree and identifies plateau points where reasoning novelty decays; and (2) a lightweight latent-layer probe to assess consistency between intermediate reasoning structures and the final response. The method is jointly calibrated across three mainstream LLMs and four benchmark datasets. Results show that, on in-distribution data, our approach reduces average reasoning tokens by 60% without performance degradation; on out-of-distribution data, it achieves a 20% reduction in computational overhead while simultaneously improving both inference confidence and efficiency.

Technology Category

Application Category

📝 Abstract

Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

Problem

Research questions and friction points this paper is trying to address.

Dynamic termination of reasoning to reduce compute cost

Calibrating decision rule using nested reasoning trees

Lightweight probes to assess reasoning structure and consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic termination of thinking via thought calibration

Lightweight probes analyze hidden representations

Nested reasoning trees identify reasoning plateaus

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art

2023-08-02arXiv.orgCitations: 54

💼 Related Jobs

TL, Research Inference

OpenAI

$380K – $555K • Offers Equity

San Francisco

Authors to Follow