Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the trade-off between efficiency and generation quality in diffusion-based large language models (dLLMs) for speculative decoding, this paper proposes FailFast: a framework that leverages dLLMs as drafters and uniquely exploits their “fast-but-error-prone” characteristic as an advantage. FailFast introduces a dynamic length-adaptive mechanism that terminates drafting early in hard-to-speculate regions to minimize latency, while substantially extending draft lengths—up to 70 tokens—in easy regions, achieving an optimal “fail-fast, win-big” balance. Crucially, it requires no fine-tuning and integrates zero-shot with autoregressive (AR) verifiers, preserving generation quality without degradation. Experiments demonstrate up to 4.9× end-to-end speedup over standard AR decoding, significantly outperforming fixed-length dLLM baselines (+1.7×) and EAGLE-3 (+1.4×). The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$ imes$ speedup over vanilla decoding, 1.7$ imes$ over the best naive dLLM drafter, and 1.4$ imes$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.

Problem

Research questions and friction points this paper is trying to address.

Optimizes speculative decoding efficiency using diffusion LLMs as drafters

Addresses trade-off between generation speed and quality in parallel token generation

Dynamically adapts draft length to minimize rejection risk and maximize acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion LLMs for parallel token drafting

Dynamically adapts draft length based on difficulty

Achieves lossless acceleration without model fine-tuning

🔎 Similar Papers

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

2024-08-10arXiv.orgCitations: 0

Cascade Speculative Drafting for Even Faster LLM Inference

2023-12-18Neural Information Processing SystemsCitations: 52

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow