Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation of speculative decoding (SD) in large language model (LLM) serving under high load—caused by excessive verification overhead from fixed speculation lengths—this paper proposes the first learning-based, dynamically adaptive SD mechanism. Our approach employs a lightweight, load-aware model to predict the optimal speculation length in real time, coupled with an online control algorithm that dynamically enables or disables speculation per batch based on both batch size and system load, enabling fine-grained runtime scheduling. Compared to standard SD, our method achieves up to 14.8% higher throughput and reduces end-to-end latency by 20.2%, significantly improving service robustness and resource efficiency in high-concurrency scenarios.

Technology Category

Application Category

📝 Abstract
Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving.
Problem

Research questions and friction points this paper is trying to address.

Adapts speculative decoding to dynamic request loads
Optimizes speculative length for varying batch sizes
Reduces performance overhead in compute-bound environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adaptive speculative decoding algorithm
Learning-based adjustment of speculative length
Selective disabling for optimal performance
🔎 Similar Papers
No similar papers found.
R
Rui Li
State Key Laboratory of Complex & Critical Software Environment, National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China
Zhaoning Zhang
Zhaoning Zhang
National University of Defense Technology
MLSysCompute VisionDistributed Computing
L
Libo Zhang
State Key Laboratory of Complex & Critical Software Environment, National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China
H
Huaimin Wang
State Key Laboratory of Complex & Critical Software Environment, National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China
X
Xiang Fu
State Key Laboratory of Complex & Critical Software Environment, National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China
Z
Zhiquan Lai
State Key Laboratory of Complex & Critical Software Environment, National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China