🤖 AI Summary
To address the efficiency degradation and accuracy loss in large reasoning language models caused by excessive reasoning during long chain-of-thought (CoT) generation, this paper proposes a training-free dynamic early-exit mechanism. The method leverages token-level confidence scores intrinsic to the model to assess and autonomously terminate redundant reasoning steps in real time—particularly at reasoning transition points (e.g., “Wait” tokens)—thereby overcoming the limitations of fixed-length truncation. Its core innovations include token-behavior monitoring, adaptive confidence modeling, and a dynamic termination policy, all natively compatible with o1-style reasoning architectures. Evaluated on four major benchmarks—including MATH-500—the approach achieves 31–43% average CoT compression while improving accuracy by 1.7–5.7 percentage points, marking the first demonstration of concurrent high accuracy and high efficiency in CoT-based reasoning.
📝 Abstract
Recent advances in large reasoning language models (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points (e.g.,"Wait"tokens) and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on multiple reasoning benchmarks MATH-500, AMC 2023, GPQA Diamond and AIME 2024 show that the proposed method is consistently effective on deepseek-series reasoning LLMs, reducing the length of CoT sequences by an average of 31% to 43% while improving accuracy by 1.7% to 5.7%.