Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

📅 2026-02-13

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing approaches often assess large language models’ reasoning capabilities by generation length, yet this metric lacks a reliable correlation with accuracy and can even degrade performance due to “overthinking.” This work introduces the concept of “deep-thinking tokens,” defined as positions where the model’s internal predictions undergo significant revisions at deeper layers. Building on this, we propose a deep-thinking ratio to quantify the extent of reasoning effort and leverage it to design Think@n, a test-time early-stopping strategy. Our method establishes, for the first time, a strong link between reasoning quality and dynamic changes in internal representations, overcoming limitations of traditional length- or confidence-based approaches. Experiments show that the deep-thinking ratio exhibits a significant positive correlation with accuracy on benchmarks such as AIME, HMMT, and GPQA-Diamond, and that Think@n matches or surpasses self-consistency methods while reducing inference cost.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal"overthinking,"leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

Problem

Research questions and friction points this paper is trying to address.

reasoning effort

deep-thinking tokens

Chain-of-Thought

inference-time compute

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

deep-thinking tokens

reasoning effort

test-time scaling