KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

๐Ÿ“… 2025-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Self-speculative decoding (SSD) accelerates LLM inference via layer-skipping to construct lightweight draft models, but its fixed skip-layer strategy suffers substantial performance degradation under domain shifts. To address this, we propose a KNN-driven dynamic domain-matching mechanismโ€”the first to integrate parameter-free K-nearest-neighbor search into SSD. Our method dynamically retrieves optimal layer-skip configurations in real time based on input representations, enabling zero-training, zero-parameter cross-domain adaptation. Crucially, it requires no architectural modification to the backbone model or additional fine-tuning. Extensive experiments across multiple LLMs (Llama-2/3, Qwen) and diverse tasks (commonsense reasoning, code generation, mathematical QA) demonstrate consistent inference speedups of 1.3ร—โ€“1.6ร—. Moreover, our approach significantly enhances SSDโ€™s robustness and generalization under distribution shift, establishing a new paradigm for adaptive speculative decoding.

Technology Category

Application Category

๐Ÿ“ Abstract
Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Improves domain generalizability of Self-Speculative Decoding
Optimizes layer skipping for dynamic domain adaptation
Accelerates LLM inference via KNN-based layer matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer skipping via KNN optimization
Domain-adaptive nearest neighbor search
Self-speculative decoding acceleration
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Mingbo Song
National Key Laboratory for Multimedia Information Processing, Peking University
Heming Xia
Heming Xia
Natural Language Processing Group, The Hong Kong Polytechnic University
Natural Language ProcessingLarge Language Models
J
Jun Zhang
College of Computer Science and Technology, Zhejiang University
C
Chak Tou Leong
Department of Computing, The Hong Kong Polytechnic University
Q
Qiancheng Xu
Department of Computing, The Hong Kong Polytechnic University
W
Wenjie Li
Department of Computing, The Hong Kong Polytechnic University
S
Sujian Li
National Key Laboratory for Multimedia Information Processing, Peking University