SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models

πŸ“… 2025-03-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing speculative decoding methods employ homogeneous small models, limiting adaptability to varying request complexities, exhibiting poor batch-processing support, and lacking co-optimization between speculation and verification stages. To address these limitations, we propose a heterogeneous speculative decoding system. First, we introduce a novel dynamic selection mechanism for multiple heterogeneous speculative models, adaptively assigning the optimal small model based on per-request difficulty estimation. Second, we design a request decomposition scheduling strategy to reduce batch overhead during verification. Third, we develop a GPU-level speculation-verification pipelined execution framework to tightly couple the two stages. Experimental results demonstrate a 2.28Γ— speedup in end-to-end inference latency over state-of-the-art approaches, achieving superior throughput and latency trade-offs. Our work establishes a new paradigm for efficient large language model inference through principled heterogeneity and system-level co-design.

Technology Category

Application Category

πŸ“ Abstract
Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently verified by the LLM in a verification phase. However, current state-of-the-art speculative decoding approaches have three key limitations: handling requests with varying difficulty using homogeneous SSMs, lack of robust support for batch processing, and insufficient holistic optimization for both speculation and verification phases. In this paper, we introduce SPIN, an efficient LLM inference serving system based on speculative decoding, designed to address these challenges through three main innovations. First, SPIN improves token speculation by using multiple heterogeneous SSMs, with a learning-based algorithm for SSM selection that operates without prior knowledge of request difficulty. Second, SPIN employs a request decomposition method to minimize batching overhead during LLM verification. Finally, SPIN orchestrates speculation and verification phases by pipelining their executions on GPUs to achieve further acceleration. Experimental results demonstrate that SPIN significantly outperforms state-of-the-art methods, achieving a performance increase of approximately 2.28X.
Problem

Research questions and friction points this paper is trying to address.

Handles varying request difficulty with heterogeneous SSMs
Minimizes batching overhead during LLM verification
Pipelines speculation and verification phases for acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multiple heterogeneous SSMs for token speculation
Implements request decomposition to reduce batching overhead
Pipelines speculation and verification phases on GPUs
πŸ”Ž Similar Papers
No similar papers found.
Fahao Chen
Fahao Chen
the University of Aizu
Cloud computingmachine learning
P
Peng Li
School of Cyber Science and Engineering, Xi’an Jiaotong University, China
T
Tom H. Luan
School of Cyber Science and Engineering, Xi’an Jiaotong University, China
Zhou Su
Zhou Su
Xi'an Jiaotong University
J
Jing Deng
Department of Computer Science, University of North Carolina at Greensboro, USA