Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the performance degradation of speculative decoding (SD) in large language model (LLM) serving under high load—caused by excessive verification overhead from fixed speculation lengths—this paper proposes the first learning-based, dynamically adaptive SD mechanism. Our approach employs a lightweight, load-aware model to predict the optimal speculation length in real time, coupled with an online control algorithm that dynamically enables or disables speculation per batch based on both batch size and system load, enabling fine-grained runtime scheduling. Compared to standard SD, our method achieves up to 14.8% higher throughput and reduces end-to-end latency by 20.2%, significantly improving service robustness and resource efficiency in high-concurrency scenarios.

Technology Category

Application Category

📝 Abstract

Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving.

Problem

Research questions and friction points this paper is trying to address.

Adapts speculative decoding to dynamic request loads

Optimizes speculative length for varying batch sizes

Reduces performance overhead in compute-bound environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adaptive speculative decoding algorithm

Learning-based adjustment of speculative length

Selective disabling for optimal performance

🔎 Similar Papers

No similar papers found.