🤖 AI Summary
To address the performance degradation of speculative decoding (SD) in large language model (LLM) serving under high load—caused by excessive verification overhead from fixed speculation lengths—this paper proposes the first learning-based, dynamically adaptive SD mechanism. Our approach employs a lightweight, load-aware model to predict the optimal speculation length in real time, coupled with an online control algorithm that dynamically enables or disables speculation per batch based on both batch size and system load, enabling fine-grained runtime scheduling. Compared to standard SD, our method achieves up to 14.8% higher throughput and reduces end-to-end latency by 20.2%, significantly improving service robustness and resource efficiency in high-concurrency scenarios.
📝 Abstract
Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving.