SpecExit: Accelerating Large Reasoning Model via Speculative Exit

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) suffer from excessive computation—termed “overthinking”—leading to verbose outputs and high end-to-end latency, hindering practical deployment. To address this, we propose a probe-free forward early-exit mechanism: a lightweight draft model jointly predicts the next token and an exit signal directly from the main model’s hidden states, enabling dynamic, real-time inference termination. Crucially, our approach unifies token prediction and exit decision-making within a single lightweight head—eliminating the computational overhead and latency introduced by conventional probe-based exit methods. Evaluated across multiple reasoning benchmarks, our method reduces average generation length by 66% and improves end-to-end latency by 2.5×, while strictly preserving the original model’s accuracy. This work establishes a novel, general-purpose early-exit paradigm for efficient LRM inference—characterized by low computational overhead, zero accuracy degradation, and seamless integration into standard autoregressive decoding.

Technology Category

Application Category

📝 Abstract
Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.
Problem

Research questions and friction points this paper is trying to address.

Reduces overthinking in large reasoning models
Minimizes detection overhead in early-exit mechanisms
Accelerates reasoning without compromising model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts tokens and exit signals from hidden states
Uses lightweight draft model without probing overhead
Reduces generation length and latency while maintaining accuracy
Rubing Yang
Rubing Yang
University of Pennsylvania
Deep learningMachine perception
H
Huajun Bai
Tencent
S
Song Liu
Tencent
G
Guanghua Yu
Tencent
R
Runzhi Fan
Tencent
Y
Yanbin Dang
Tencent
J
Jiejing Zhang
Tencent
K
Kai Liu
Tencent
J
Jianchen Zhu
Tencent
P
Peng Chen
Tencent