SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency and the incompatibility between speculative decoding and tensor parallelism in single-query, long-output LLM scenarios—caused by computational imbalance between draft and target models, KV cache inconsistency, and overhead from small-batch inter-GPU communication—this paper proposes an asynchronous, decoupled speculative decoding architecture. It fully separates draft generation from verification, introduces a parallel tree decoding strategy, designs a tree-aware dynamic KV cache sharding mechanism, and develops low-latency fused CUDA kernels optimized for multi-GPU tensor parallelism. Evaluated across five model families and six datasets, the approach achieves an average 1.75× speedup. Notably, Llama3-70B attains 348 tokens/sec on eight Hopper GPUs—setting a new record for lowest-latency serving at this scale.

Technology Category

Application Category

📝 Abstract
Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a small draft model with a larger target model) and tensor parallelism has each accelerated decoding. However, conventional approaches fail to apply both simultaneously due to imbalanced compute requirements (between draft and target models), KV-cache inconsistencies, and communication overheads under small-batch tensor-parallelism. This paper introduces SwiftSpec, a system that targets ultra-low latency for LLM decoding. SwiftSpec redesigns the speculative decoding pipeline in an asynchronous and disaggregated manner, so that each component can be scaled flexibly and remove draft overhead from the critical path. To realize this design, SwiftSpec proposes parallel tree generation, tree-aware KV cache management, and fused, latency-optimized kernels to overcome the challenges listed above. Across 5 model families and 6 datasets, SwiftSpec achieves an average of 1.75x speedup over state-of-the-art speculative decoding systems and, as a highlight, serves Llama3-70B at 348 tokens/s on 8 Nvidia Hopper GPUs, making it the fastest known system for low-latency LLM serving at this scale.
Problem

Research questions and friction points this paper is trying to address.

Achieving ultra-low latency in large language model decoding
Overcoming imbalanced compute in speculative decoding and tensor parallelism
Reducing communication overhead in small-batch tensor-parallelism scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous and disaggregated speculative decoding pipeline
Parallel tree generation and tree-aware KV cache
Fused latency-optimized kernels for efficiency
🔎 Similar Papers
No similar papers found.