SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of speculative decoding (SD)—namely its reliance on auxiliary models, additional training overhead, and poor generalization—this paper proposes Self-Speculative Decoding, a novel inference acceleration method for large language models (LLMs) that requires no auxiliary model, introduces zero parameter增量, and is entirely training-free. Leveraging inherent layer-wise response sparsity within the target LLM, our approach dynamically skips redundant Transformer layers during inference. We further design a lightweight confidence prediction module coupled with an autoregressive verification mechanism to enable input-adaptive, end-to-end speculative decoding. As the first purely layer-skipping, plug-and-play SD paradigm, it achieves 1.3×–1.6× speedup across diverse models (e.g., Llama-2/3, Qwen, Phi-3) and tasks, while strictly preserving the original generation distribution and requiring no fine-tuning or deployment modifications.

Technology Category

Application Category

📝 Abstract
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. SWIFT does not require auxiliary models or additional training, making it a plug-and-play solution for accelerating LLM inference across diverse input data streams. Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text. We release our code in https://github.com/hemingkx/SWIFT.
Problem

Research questions and friction points this paper is trying to address.

Accelerates LLM inference without quality loss
Eliminates need for additional models or training
Adaptively skips layers for self-speculative decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play layer-skipping for LLM acceleration
Self-speculative decoding without auxiliary models
Adaptive layer selection for task-specific sparsity
🔎 Similar Papers
No similar papers found.
Heming Xia
Heming Xia
Natural Language Processing Group, The Hong Kong Polytechnic University
Natural Language ProcessingLarge Language Models
Y
Yongqi Li
Department of Computing, The Hong Kong Polytechnic University
J
Jun Zhang
College of Computer Science and Technology, Zhejiang University
Cunxiao Du
Cunxiao Du
Research Scientist at Sea AI Lab
NLPLLM Inference
W
Wenjie Li
Department of Computing, The Hong Kong Polytechnic University