TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the inefficiency and limited interpretability of large language models (LLMs) during reasoning, which often generate redundant steps that incur high computational costs. To mitigate this, the authors propose TRACES, a novel framework that introduces a lightweight classifier to dynamically label each reasoning step in real time according to its functional role—such as derivation, verification, or redundancy. Leveraging these annotations, TRACES implements an adaptive early-stopping mechanism that halts generation when further steps are deemed superfluous. Evaluated across multiple benchmarks—including MATH500, GSM8K, AIME, MMLU, and GPQA—the method reduces token generation by 20%–50% while preserving accuracy comparable to standard reasoning approaches, thereby substantially enhancing both computational efficiency and the interpretability of the reasoning process.

Technology Category

Application Category

📝 Abstract

The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.

Problem

Research questions and friction points this paper is trying to address.

Language Reasoning Models

reasoning efficiency

over-generation

reasoning steps

early stopping

Innovation

Methods, ideas, or system contributions that make the work stand out.

early stopping

reasoning step tagging

cost-efficient inference