🤖 AI Summary
Existing speculative decoding methods (e.g., EAGLE) rely on *N* sequential forward passes to generate *N* candidate tokens, limiting inference acceleration. This work proposes Non-Autoregressive Cascaded Drafting (NACD), which generates multi-layer draft tokens in parallel via a single forward pass, fully eliminating sequential dependencies during drafting. Key contributions include: (1) a lightweight hierarchical cascaded architecture; (2) inter-layer supervised training to enhance draft quality; and (3) structural constraints on the draft tree to guarantee lossless verification. NACD is compatible with both greedy and stochastic decoding. Evaluated across multiple large language models and tasks, it outperforms EAGLE-3 with comparable acceptance rates and achieves up to 42% reduction in end-to-end latency—significantly improving the efficiency and practicality of speculative decoding.
📝 Abstract
Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.