CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speculative decoding faces three key challenges: limited acceleration from immediate self-speculation; high training overhead for multi-level cascaded draft models; and suboptimal efficiency of conventional vertical/horizontal cascading in self-speculative inference. This paper proposes CAS-Spec, the first framework to integrate layer-wise sparsification and activation quantization into cascaded self-speculation. It introduces a Dynamic Tree-based Cascading (DyTC) algorithm that enables adaptive routing and multi-level draft-path optimization based on acceptance rate and latency prediction. Coupled with a Dynamic Switchable Inference Acceleration (DSIA) scheduling strategy, CAS-Spec achieves lossless speedups of 1.1×–2.3× across diverse large language models and datasets. Compared to cascaded and tree-based baselines, it improves average acceleration ratios by 47% and 48%, respectively.

Technology Category

Application Category

📝 Abstract
Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from $1.1 imes$ to $2.3 imes$ over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by $47$% and $48$% over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.
Problem

Research questions and friction points this paper is trying to address.

Accelerating LLM inference without accuracy loss
Reducing training costs for multi-model cascade systems
Optimizing draft model routing and length allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages dynamically switchable inference acceleration strategies
Introduces Dynamic Tree Cascade algorithm for adaptive routing
Constructs speculative draft models using layer sparsity and quantization
🔎 Similar Papers
No similar papers found.
Zhiyuan Ning
Zhiyuan Ning
Westlake University
Graph Machine LearningKnowledge GraphsLarge Language Models
J
Jiawei Shao
TeleAI, Shanghai Jiao Tong University
R
Ruge Xu
Shanghai Jiao Tong University
Xinfei Guo
Xinfei Guo
Shanghai Jiao Tong University
VLSIEDAReliabilityLow PowerMicroarchitecture
J
Jun Zhang
Hong Kong University of Science and Technology
C
Chi Zhang
TeleAI, Shanghai Jiao Tong University
X
Xuelong Li
TeleAI, Shanghai Jiao Tong University