QSpec: Speculative Decoding with Complementary Quantization Schemes

📅 2024-10-15
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the sharp degradation in multi-step inference performance of large language models (LLMs) under low-bit quantization on edge devices, this paper proposes QSpec—a dual-mode complementary quantization paradigm tailored for speculative decoding. Our key contribution is the first zero-overhead dynamic coordination between joint activation-weight quantization (in the draft stage) and weight-only quantization (in the verify stage), requiring no fine-tuning, imposing zero memory overhead, and enabling plug-and-play deployment. By reusing KV caches and adaptively switching execution paths, QSpec simultaneously achieves high throughput and high acceptance rate without compromising generation quality. Experiments demonstrate that, compared to FP16 baselines, QSpec achieves up to 1.64× higher token throughput; against state-of-the-art speculative decoding methods, it delivers 1.55× batched inference speedup with superior acceptance rates. The gains are consistently robust across diverse model scales, quantization strategies, and batch sizes.

Technology Category

Application Category

📝 Abstract
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models (LLMs). While activation-weight joint quantization speeds up the inference process through low-precision kernels, we demonstrate that it suffers severe performance degradation on multi-step reasoning tasks, rendering it ineffective. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. Leveraging nearly cost-free execution switching, QSPEC drafts tokens with low-precision, fast activation-weight quantization, and verifies them with high-precision weight-only quantization, effectively combining the strengths of both quantization schemes. Compared to high-precision quantization methods, QSPEC empirically boosts token generation throughput by up to 1.64x without any quality compromise, distinguishing it from other low-precision quantization approaches. This enhancement is also consistent across various serving tasks, model sizes, quantization methods, and batch sizes. Compared to state-of-art speculative decoding methods, our approach reuses weights and the KV cache, avoiding extra memory overhead while achieving up to 1.55x speedup in batched serving with a high acceptance rate. Furthermore, QSPEC offers a plug-and-play advantage without requiring any training. We believe that QSPEC demonstrates unique strengths for future deployment of high-fidelity quantization schemes, particularly in memory-constrained scenarios (e.g., edge devices).
Problem

Research questions and friction points this paper is trying to address.

Quantization
Large Language Models
Accuracy-Speed Tradeoff
Innovation

Methods, ideas, or system contributions that make the work stand out.

QSpec Quantization
Speculative Decoding
Memory-Efficient Large Language Models
🔎 Similar Papers
No similar papers found.