🤖 AI Summary
To address the inefficiency of external draft models and the inflexibility of self-speculative methods—both requiring retraining—in speculative decoding, this work introduces Mamba, a linear-complexity state-space model, as the first lightweight and general-purpose external draft model. We propose a test-time tree search algorithm that dynamically generates high-quality candidate sequences without modifying the target language model, enabling plug-and-play cross-model deployment. Compared to existing external drafters, our approach achieves significant inference speedup, matches the performance of the best self-speculative methods, reduces memory overhead by over 30%, and supports seamless adaptation across multiple large language models. Key innovations include: (1) the first application of the Mamba architecture to speculative decoding; (2) a fine-tuning-free design ensuring broad compatibility; and (3) an efficient tree search mechanism operating at linear computational complexity.
📝 Abstract
Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.