Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a performance degradation mechanism when combining 4-bit weight quantization with advanced tree-based speculative decoding (e.g., EAGLE-2): memory savings from quantization are offset by the high computational overhead and memory bandwidth bottlenecks induced by tree-structured draft generation. To address this, we propose a hierarchical speculative framework that replaces the tree-based draft generator with a lightweight intermediate model producing sequential drafts—thereby preserving efficiency gains from low-bit matrix multiplication while respecting memory bandwidth constraints. Our approach is the first to enable compatible integration of 4-bit quantized models (AWQ/GPTQ) with efficient speculative decoding. On an A100 GPU, it achieves a 2.78× speedup for 4-bit Llama-3-70B inference—1.31× faster than EAGLE-2—while maintaining strong generalization across diverse downstream tasks.

Technology Category

Application Category

📝 Abstract
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$ imes$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$ imes$. Code available at https://github.com/AI9Stars/SpecMQuant.
Problem

Research questions and friction points this paper is trying to address.

Evaluate compatibility between speculative decoding and quantization techniques
Address diminished memory benefits from 4-bit quantization in speculative decoding
Design hierarchical framework to optimize memory access and computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates speculative decoding with quantization techniques
Hierarchical framework uses small intermediate model
Optimizes 4-bit quantized models for faster inference
🔎 Similar Papers
No similar papers found.
Y
Yudi Zhang
Faculty of Computing, Harbin Institute of Technology, Harbin, China.
Weilin Zhao
Weilin Zhao
Tsinghua University
Natural Language ProcessingArtificial IntelligenceEfficient LLM
X
Xu Han
Tsinghua University, Beijing, China.
T
Tiejun Zhao
Faculty of Computing, Harbin Institute of Technology, Harbin, China.
Wang Xu
Wang Xu
Harbin Institute of Technology
natural language processingartificial intelligence
Hailong Cao
Hailong Cao
Harbin Institute of Technology
C
Conghui Zhu
Faculty of Computing, Harbin Institute of Technology, Harbin, China.