Mamba Drafters for Speculative Decoding

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address the inefficiency of external draft models and the inflexibility of self-speculative methods—both requiring retraining—in speculative decoding, this work introduces Mamba, a linear-complexity state-space model, as the first lightweight and general-purpose external draft model. We propose a test-time tree search algorithm that dynamically generates high-quality candidate sequences without modifying the target language model, enabling plug-and-play cross-model deployment. Compared to existing external drafters, our approach achieves significant inference speedup, matches the performance of the best self-speculative methods, reduces memory overhead by over 30%, and supports seamless adaptation across multiple large language models. Key innovations include: (1) the first application of the Mamba architecture to speculative decoding; (2) a fine-tuning-free design ensuring broad compatibility; and (3) an efficient tree search mechanism operating at linear computational complexity.

Technology Category

Application Category

📝 Abstract

Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model's distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.

Problem

Research questions and friction points this paper is trying to address.

Accelerating LLM generation while maintaining distribution alignment

Overcoming trade-offs between flexibility and drafting speed

Reducing memory usage and complexity in speculative decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based drafters for flexible fast drafting

Linear SSM structure reduces complexity and memory

Tree search algorithm enhances draft candidate quality

🔎 Similar Papers

No similar papers found.