KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the limitations of existing speculative decoding methods in long-context scenarios, where static heuristics fail to adapt to the dynamic computational overhead of attention mechanisms. We formalize draft model selection as a knapsack optimization problem and propose a novel framework that decouples attention and MLP layers, constructs a context-aware hardware latency model, and employs a parallel dynamic programming algorithm to select optimal draft configurations in real time for maximal throughput. To ensure draft fidelity without retraining, we introduce a theoretical guarantee based on cosine similarity and develop a training-free, adaptive layer selection mechanism. Evaluated on Qwen3 and Llama3, our approach achieves up to 1.47× end-to-end speedup over state-of-the-art methods while preserving the target model’s output distribution.

Technology Category

Application Category

📝 Abstract

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

Problem

Research questions and friction points this paper is trying to address.

self-speculative decoding

long-context inference

dynamic computational overhead

attention mechanism

LLM acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-speculative decoding

knapsack problem

adaptive layer selection