FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speculative decoding methods (e.g., EAGLE) rely on *N* sequential forward passes to generate *N* candidate tokens, limiting inference acceleration. This work proposes Non-Autoregressive Cascaded Drafting (NACD), which generates multi-layer draft tokens in parallel via a single forward pass, fully eliminating sequential dependencies during drafting. Key contributions include: (1) a lightweight hierarchical cascaded architecture; (2) inter-layer supervised training to enhance draft quality; and (3) structural constraints on the draft tree to guarantee lossless verification. NACD is compatible with both greedy and stochastic decoding. Evaluated across multiple large language models and tasks, it outperforms EAGLE-3 with comparable acceptance rates and achieves up to 42% reduction in end-to-end latency—significantly improving the efficiency and practicality of speculative decoding.

Technology Category

Application Category

📝 Abstract
Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.
Problem

Research questions and friction points this paper is trying to address.

Accelerating speculative decoding by removing sequential drafting passes
Proposing entire token drafts in single forward pass instead of N steps
Maintaining competitive acceptance rates while achieving lossless speedup
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-autoregressive cascaded drafter for single-pass drafting
Layer cascade with supervision to reduce error accumulation
Constrained draft tree for maintaining lossless verification
🔎 Similar Papers
No similar papers found.
H
Haiduo Huang
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jiangcheng Song
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
W
Wenzhe Zhao
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Pengju Ren
Pengju Ren
Professor, Xi'an Jiaotong University