From Quarter to All: Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address high inference latency, quantization-induced accuracy degradation, and non-negligible overhead from existing speculative decoding methods for large language models (LLMs), this paper proposes SPEQ—a training-free, zero-storage-overhead, lossless speculative decoding framework. SPEQ’s core innovations are: (1) dynamic lightweight draft model construction via floating-point exponent remapping and parameter sharing, reusing original model weights without additional parameters; and (2) a novel algorithm-hardware co-design enabling unified execution of draft generation and verification on reconfigurable computing arrays. SPEQ incurs no training cost or storage overhead while achieving both high throughput and zero accuracy loss. Extensive experiments across 15 LLMs and diverse tasks demonstrate that SPEQ accelerates inference by 2.07×, 1.53×, and 1.45× over FP16 baseline, Olive, and Tender, respectively.

Technology Category

Application Category

📝 Abstract

Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While quantization reduces model size, it often leads to performance degradation compared to the full model. Speculative decoding remains lossless but typically incurs extra overheads. We propose SPEQ, an algorithm-hardware co-designed speculative decoding method that uses part of the full-model weight bits to form a quantized draft model, thereby eliminating additional training or storage overhead. A reconfigurable processing element array enables efficient execution of both the draft and verification passes. Experimental results across 15 LLMs and tasks demonstrate that SPEQ achieves speedups of 2.07x, 1.53x, and 1.45x compared over FP16, Olive, and Tender, respectively.

Problem

Research questions and friction points this paper is trying to address.

Accelerating speculative LLM decoding via hardware-software co-design

Reducing inference latency without performance degradation

Eliminating extra training and storage overheads in quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses quantized draft model from full-model weights

Employs reconfigurable processing element array

Eliminates extra training and storage overhead

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling