STree: Speculative Tree Decoding for Hybrid State-Space Models

πŸ“… 2025-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of efficiently implementing tree-based speculative decoding in state space models (SSMs) and SSM-Transformer hybridsβ€”a problem previously unexplored. We propose the first scalable, SSM-native tree-based speculative decoding algorithm. Leveraging structural properties of the SSM state transition matrix, our method enables low-overhead token tree generation and verification, while supporting hardware-aware hybrid inference acceleration. Unlike existing SSM speculative decoding approaches, it eliminates the need for auxiliary draft models. Evaluated on three benchmarks, our method achieves significant throughput gains without compromising generation quality. The core contribution lies in overcoming intrinsic SSM architectural constraints that hinder tree construction, thereby establishing the first systematic, scalable tree-based acceleration paradigm for SSM-family models in autoregressive inference.

Technology Category

Application Category

πŸ“ Abstract
Speculative decoding is a technique to leverage hardware concurrency to improve the efficiency of large-scale autoregressive (AR) Transformer models by enabling multiple steps of token generation in a single forward pass. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead to current SSM state update implementations. With the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code will be released upon paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

Enables tree-based speculative decoding for efficient SSMs and hybrid models
Improves SSM state update with minimal overhead using transition matrices
Outperforms vanilla speculative decoding in SSMs across multiple benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based speculative decoding for SSMs
Hybrid SSM-Transformer architecture optimization
Hardware-aware state transition matrix utilization
πŸ”Ž Similar Papers
No similar papers found.