CLaSp: In-Context Layer Skip for Self-Speculative Decoding

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing speculative decoding methods require additional training of draft models, incurring high deployment costs and poor compatibility. This work proposes Self-Speculative Decoding—a training-free, module-free approach that dynamically skips redundant intermediate layers of the target (verifier) model during inference via context-aware layer skipping, thereby constructing a lightweight, plug-and-play implicit draft path. We further optimize the skipping policy using dynamic programming, incorporating hidden-state guidance and multi-stage state reuse to enhance verification efficiency. Evaluated on LLaMA3-family models, our method achieves 1.3×–1.7× end-to-end speedup while strictly preserving the original generation distribution. The core contribution is the first realization of a speculative decoding paradigm that is entirely training-free, dynamically adaptive, and fully compatible with unmodified base models—eliminating the need for auxiliary modules or retraining.

Technology Category

Application Category

📝 Abstract

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.

Problem

Research questions and friction points this paper is trying to address.

Accelerating LLM decoding without additional training modules

Ensuring draft-verify model consistency via layer-skipping strategy

Dynamic optimization of skipped layers for faster inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context layer-skipping for self-speculative decoding

Plug-and-play mechanism without extra training

Dynamic programming optimizes layer-skipping strategy

🔎 Similar Papers

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

2024-10-09arXiv.orgCitations: 1

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow