Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bandwidth bottleneck in the verification phase of speculative decoding for large language model inference. It introduces, for the first time, a training-free, plug-and-play memory-efficient verification framework that applies low-bit quantization specifically to this phase. The proposed method significantly reduces memory traffic while preserving high verification accuracy and accepted sequence length. Experimental results demonstrate that the approach achieves a 1.28× end-to-end throughput improvement on models such as OpenPangu and Qwen3, effectively overcoming the memory wall that constrains current speculative decoding systems.

Technology Category

Application Category

📝 Abstract
Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self-speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory-bandwidth bound, fundamentally limiting the maximum achievable speedup.In this paper, we introduce \textbf{Quasar} (\textbf{Qua}ntized \textbf{S}elf-speculative \textbf{A}cceleration for \textbf{R}apid Inference), a novel, training-free framework designed to overcome this "memory wall" by employing low-bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization-based verification preserves the logit distribution with high fidelity while effectively halving memory traffic. Extensive experiments on state-of-the-art models (e.g., OpenPangu and Qwen3) demonstrate that Quasar maintains a speculative acceptance length comparable to full-precision methods while achieving a $1.28\times$ improvement in end-to-end throughput. Being orthogonal to existing drafting strategies, Quasar offers a generic and efficient pathway to accelerate the verification leg of speculative execution. Code is available at https://github.com/Tom-HG/Quasar.
Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding
LLM inference
memory bandwidth bottleneck
verification phase
quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization
Speculative Decoding
Memory-Efficient Verification
LLM Inference Acceleration
Self-Speculation
🔎 Similar Papers
No similar papers found.