FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing speculative decoding methods (e.g., EAGLE) rely on *N* sequential forward passes to generate *N* candidate tokens, limiting inference acceleration. This work proposes Non-Autoregressive Cascaded Drafting (NACD), which generates multi-layer draft tokens in parallel via a single forward pass, fully eliminating sequential dependencies during drafting. Key contributions include: (1) a lightweight hierarchical cascaded architecture; (2) inter-layer supervised training to enhance draft quality; and (3) structural constraints on the draft tree to guarantee lossless verification. NACD is compatible with both greedy and stochastic decoding. Evaluated across multiple large language models and tasks, it outperforms EAGLE-3 with comparable acceptance rates and achieves up to 42% reduction in end-to-end latency—significantly improving the efficiency and practicality of speculative decoding.

Technology Category

Application Category

📝 Abstract

Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.

Problem

Research questions and friction points this paper is trying to address.

Accelerating speculative decoding by removing sequential drafting passes

Proposing entire token drafts in single forward pass instead of N steps

Maintaining competitive acceptance rates while achieving lossless speedup

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-autoregressive cascaded drafter for single-pass drafting

Layer cascade with supervision to reduce error accumulation

Constrained draft tree for maintaining lossless verification

🔎 Similar Papers

No similar papers found.

Authors to Follow