π€ AI Summary
This work addresses the limitations of existing speculative decoding methods in long-context scenarios, where static heuristics fail to adapt to the dynamic computational overhead of attention mechanisms. We formalize draft model selection as a knapsack optimization problem and propose a novel framework that decouples attention and MLP layers, constructs a context-aware hardware latency model, and employs a parallel dynamic programming algorithm to select optimal draft configurations in real time for maximal throughput. To ensure draft fidelity without retraining, we introduce a theoretical guarantee based on cosine similarity and develop a training-free, adaptive layer selection mechanism. Evaluated on Qwen3 and Llama3, our approach achieves up to 1.47Γ end-to-end speedup over state-of-the-art methods while preserving the target modelβs output distribution.
π Abstract
Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.