🤖 AI Summary
Existing approaches often assess large language models’ reasoning capabilities by generation length, yet this metric lacks a reliable correlation with accuracy and can even degrade performance due to “overthinking.” This work introduces the concept of “deep-thinking tokens,” defined as positions where the model’s internal predictions undergo significant revisions at deeper layers. Building on this, we propose a deep-thinking ratio to quantify the extent of reasoning effort and leverage it to design Think@n, a test-time early-stopping strategy. Our method establishes, for the first time, a strong link between reasoning quality and dynamic changes in internal representations, overcoming limitations of traditional length- or confidence-based approaches. Experiments show that the deep-thinking ratio exhibits a significant positive correlation with accuracy on benchmarks such as AIME, HMMT, and GPQA-Diamond, and that Think@n matches or surpasses self-consistency methods while reducing inference cost.
📝 Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal"overthinking,"leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.