🤖 AI Summary
Language reasoning models (LRMs) suffer from low generation efficiency due to redundant verification and reflection steps. This paper proposes Step-Tagging, a lightweight framework that enables controllable intervention during inference by real-time identification and tagging of reasoning step types. Its core contributions are threefold: (1) the first principled ReasonType taxonomy for classifying reasoning steps; (2) an interpretable, step-count–based online early-stopping mechanism supporting dynamic termination; and (3) a lightweight sentence classifier integrated with a custom monitoring strategy. Evaluated on MATH500, GSM8K, AIME, GPQA, and MMLU-Pro, Step-Tagging achieves 20%–50% token reduction with zero accuracy degradation—yielding the greatest computational savings on highly demanding reasoning tasks.
📝 Abstract
The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.