ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing speculative decoding methods for large language model (LLM) inference acceleration require dedicated training of draft models, hindering plug-and-play deployment. Method: This paper proposes a fine-tuning-free draft model based on MXFP4 weight quantization and introduces a multi-level speculative decoding (ML-SD) architecture. Contribution/Results: It is the first work to integrate MXFP4 zero-shot quantization with recursive speculative decoding, achieving plug-and-play acceleration with zero training overhead and no model adaptation. By employing BF16/MXFP4 mixed-precision inference and a draft-target co-verification mechanism, it preserves full 16-bit accuracy without loss. Experiments show an end-to-end speedup of 2.72× over BF16 baseline inference—surpassing prior speculative decoding approaches—and maintain full compatibility with any BF16 LLM deployment.

Technology Category

Application Category

📝 Abstract

Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as"draft"to generate the next few tokens and use the"target"large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.

Problem

Research questions and friction points this paper is trying to address.

Accelerate LLM inference without accuracy loss

Enable plug-and-play speculative decoding with MXFP4 models

Achieve multi-level speculative decoding for further speedup

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MXFP4 models for plug-and-play draft generation.

Applies multi-level speculative decoding recursively.

Achieves up to 2.72x speedup over BF16 baseline.

🔎 Similar Papers

Cascade Speculative Drafting for Even Faster LLM Inference