Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale production environments, Llama model inference suffers from high latency, and efficient co-execution of tree attention and multi-round speculative decoding on GPUs remains challenging. Method: This paper proposes the first industrial-grade EAGLE-based speculative decoding system, jointly optimizing tree attention, GPU kernel-level parallel scheduling, multi-round dynamic speculation strategies, and training-inference co-design. Contribution/Results: The system significantly improves hardware utilization and throughput stability. On an 8×H100 cluster, Llama4 Maverick achieves ~4 ms per-token decoding latency; in large-batch scenarios, end-to-end inference speed improves by 1.4–2.0× over state-of-the-art methods, with 10% lower latency. This work marks the first successful deployment of EAGLE at scale—spanning thousands of GPUs—for LLM serving, establishing a scalable engineering paradigm for high-throughput, low-latency large-model inference.

Technology Category

Application Category

📝 Abstract
Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
Problem

Research questions and friction points this paper is trying to address.

Accelerating Llama model inference with speculative decoding
Implementing GPU-efficient operations for production scaling
Optimizing training and inference for state-of-the-art latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

EAGLE-based speculative decoding at production scale
Optimized tree attention and multi-round operations on GPU
Achieved state-of-the-art inference latency for Llama models