From Quarter to All: Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency, quantization-induced accuracy degradation, and non-negligible overhead from existing speculative decoding methods for large language models (LLMs), this paper proposes SPEQ—a training-free, zero-storage-overhead, lossless speculative decoding framework. SPEQ’s core innovations are: (1) dynamic lightweight draft model construction via floating-point exponent remapping and parameter sharing, reusing original model weights without additional parameters; and (2) a novel algorithm-hardware co-design enabling unified execution of draft generation and verification on reconfigurable computing arrays. SPEQ incurs no training cost or storage overhead while achieving both high throughput and zero accuracy loss. Extensive experiments across 15 LLMs and diverse tasks demonstrate that SPEQ accelerates inference by 2.07×, 1.53×, and 1.45× over FP16 baseline, Olive, and Tender, respectively.

Technology Category

Application Category

📝 Abstract
Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While quantization reduces model size, it often leads to performance degradation compared to the full model. Speculative decoding remains lossless but typically incurs extra overheads. We propose SPEQ, an algorithm-hardware co-designed speculative decoding method that uses part of the full-model weight bits to form a quantized draft model, thereby eliminating additional training or storage overhead. A reconfigurable processing element array enables efficient execution of both the draft and verification passes. Experimental results across 15 LLMs and tasks demonstrate that SPEQ achieves speedups of 2.07x, 1.53x, and 1.45x compared over FP16, Olive, and Tender, respectively.
Problem

Research questions and friction points this paper is trying to address.

Accelerating speculative LLM decoding via hardware-software co-design
Reducing inference latency without performance degradation
Eliminating extra training and storage overheads in quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses quantized draft model from full-model weights
Employs reconfigurable processing element array
Eliminates extra training and storage overhead
🔎 Similar Papers
No similar papers found.
Y
Yushu Zhao
BNRist, Tsinghua University
Y
Yubin Qin
BNRist, Tsinghua University
Y
Yang Wang
BNRist, Tsinghua University
X
Xiaolong Yang
BNRist, Tsinghua University
H
Huiming Han
BNRist, Tsinghua University
Shaojun Wei
Shaojun Wei
Professor, Tsinghua University
Y
Yang Hu
BNRist, Tsinghua University
Shouyi Yin
Shouyi Yin
Tsinghua University