π€ AI Summary
This work addresses the challenge of efficiently constructing draft models for speculative decoding to accelerate large language model inference while preserving generation quality. The authors propose ConfLayers, a training-free dynamic layer-skipping method that iteratively computes confidence scores of each layerβs output and adaptively sets thresholds to selectively skip layers with low contribution. By optimizing the skipping strategy based on these confidence estimates, ConfLayers achieves a consistent trade-off between speed and output quality across diverse models and tasks. Experimental results demonstrate up to 1.4Γ inference speedup over the original model, significantly outperforming existing draft model construction approaches.
π Abstract
Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model with selective re-evaluation by the full target model. Some existing methods form the draft model by dynamically learning which layers to skip during inference, effectively creating a smaller subnetwork to speed up computation. However, using heuristic-based approaches to select layers to skip can often be simpler and more effective. In this paper, we propose ConfLayers, a dynamic plug-and-play approach to forming the draft model in self-speculative decoding via confidence-based intermediate layer skipping. The process iteratively computes confidence scores for all layers, selects layers to skip based on an adaptive threshold, evaluates the performance of the resulting set, and updates the best selection until no further improvement is achieved or a maximum number of iterations is reached. This framework avoids the overhead and complexity of training a layer skipping policy and can provide more consistent speed-quality trade-offs while preserving the adaptivity of the draft model to diverse tasks and datasets. The performance evaluation of ConfLayers across different models and datasets shows that our novel approach offers up to 1.4x speedup over vanilla LLM generation.