🤖 AI Summary
To address high inference latency, quantization-induced accuracy degradation, and non-negligible overhead from existing speculative decoding methods for large language models (LLMs), this paper proposes SPEQ—a training-free, zero-storage-overhead, lossless speculative decoding framework. SPEQ’s core innovations are: (1) dynamic lightweight draft model construction via floating-point exponent remapping and parameter sharing, reusing original model weights without additional parameters; and (2) a novel algorithm-hardware co-design enabling unified execution of draft generation and verification on reconfigurable computing arrays. SPEQ incurs no training cost or storage overhead while achieving both high throughput and zero accuracy loss. Extensive experiments across 15 LLMs and diverse tasks demonstrate that SPEQ accelerates inference by 2.07×, 1.53×, and 1.45× over FP16 baseline, Olive, and Tender, respectively.
📝 Abstract
Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While quantization reduces model size, it often leads to performance degradation compared to the full model. Speculative decoding remains lossless but typically incurs extra overheads. We propose SPEQ, an algorithm-hardware co-designed speculative decoding method that uses part of the full-model weight bits to form a quantized draft model, thereby eliminating additional training or storage overhead. A reconfigurable processing element array enables efficient execution of both the draft and verification passes. Experimental results across 15 LLMs and tasks demonstrate that SPEQ achieves speedups of 2.07x, 1.53x, and 1.45x compared over FP16, Olive, and Tender, respectively.