SpecExit: Accelerating Large Reasoning Model via Speculative Exit

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large reasoning models (LRMs) suffer from excessive computation—termed “overthinking”—leading to verbose outputs and high end-to-end latency, hindering practical deployment. To address this, we propose a probe-free forward early-exit mechanism: a lightweight draft model jointly predicts the next token and an exit signal directly from the main model’s hidden states, enabling dynamic, real-time inference termination. Crucially, our approach unifies token prediction and exit decision-making within a single lightweight head—eliminating the computational overhead and latency introduced by conventional probe-based exit methods. Evaluated across multiple reasoning benchmarks, our method reduces average generation length by 66% and improves end-to-end latency by 2.5×, while strictly preserving the original model’s accuracy. This work establishes a novel, general-purpose early-exit paradigm for efficient LRM inference—characterized by low computational overhead, zero accuracy degradation, and seamless integration into standard autoregressive decoding.

Technology Category

Application Category

📝 Abstract

Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.

Problem

Research questions and friction points this paper is trying to address.

Reduces overthinking in large reasoning models

Minimizes detection overhead in early-exit mechanisms

Accelerates reasoning without compromising model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts tokens and exit signals from hidden states

Uses lightweight draft model without probing overhead

Reduces generation length and latency while maintaining accuracy

🔎 Similar Papers

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering