Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three critical challenges in edge deployment of large language models (LLMs)—high energy consumption under quantization, substantial dequantization overhead, and significant value overflow—this paper proposes SpikeQuant, the first method to couple the integrate-and-fire dynamics of spiking neural networks (SNNs) with quantization scaling. Its key contributions are: (1) implicit embedding of quantization scaling via binary spike coding, eliminating explicit dequantization; (2) channel-wise dynamic mixed-precision quantization driven by spike firing rates, mitigating severe truncation of salient values caused by uniform low-bit quantization; and (3) replacement of conventional multiply-accumulate (MAC) operations with threshold-modulated linear integration. Under a W4A4 configuration, SpikeQuant achieves language modeling perplexity comparable to FP16, while delivering up to 4.6× higher energy efficiency than state-of-the-art quantization methods—significantly improving the accuracy–efficiency trade-off.

Technology Category

Application Category

📝 Abstract
In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

Eliminates MAC operations to reduce energy consumption in LLMs
Avoids dequantization overhead to decrease latency and energy
Enables mixed-precision storage while preserving salient values
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spiking Neural Networks replace MACs with ACCs
Mixed-precision quantization for salient activations
Embed quantization scale into IF threshold mechanism
🔎 Similar Papers
No similar papers found.
C
Chenyu Wang
Sun Yat-sen University, Guangzhou, Guangdong, China
Zhanglu Yan
Zhanglu Yan
National University of Singapore
Artificial Intelligence
Z
Zhi Zhou
Sun Yat-sen University, Guangzhou, Guangdong, China
X
Xu Chen
Sun Yat-sen University, Guangzhou, Guangdong, China
Weng-Fai Wong
Weng-Fai Wong
Associate Professor of Computer Science, National University of Singapore
Computer architecturecompilershigh performance computingembedded systemsparallel processing