🤖 AI Summary
This work addresses the critical challenges in ANN-to-SNN conversion—namely high latency and the trade-off between energy efficiency and accuracy—where high-bit quantization demands long time windows, single-spike encoding incurs significant information loss, and multi-spike schemes suffer from elevated energy consumption. To overcome these limitations, the authors propose Kirin, a hybrid integer-and-spiking neural network architecture that introduces a novel spike-matrix hybrid strategy: low-bit parameters are converted into spikes while high-bit ones remain as integers, complemented by a silent-threshold mechanism. This approach mathematically guarantees output equivalence to the original ANN, enabling lossless accuracy conversion. Under W4A4/8 mixed quantization, Kirin achieves near-FP16 accuracy while reducing energy consumption by 84.66% and decreasing the number of time steps by 93.75%.
📝 Abstract
Artificial neural networks (ANNs), particularly large language models (LLMs), demonstrate powerful inference capabilities but consume substantial energy. Conversely, spiking neural networks (SNNs) exhibit exceptional energy efficiency due to their binary and event-driven characteristics, thus motivating the study of ANN-to-SNN conversion. In this process, quantization plays a pivotal role, mapping LLMs'floating-point parameters to discrete SNN parameters via the temporal dimension of the time window. However, several challenges remain in the conversion process: (i) converting high bit-width quantization values into binary spikes requires longer time windows, increasing system latency; and (ii) the inherent trade-off between the information loss of single-spike schemes and the energy costs of multi-spike ones in SNN. To address these challenges, we propose Kirin, a integer and spike hybrid based SNN to achieve accuracy lossless ANN-to-SNN conversion with time and energy efficiency. Specifically, we first propose a Spike Matrix Hybridization strategy that encoding low bit-width parameters that leading to small time window size into binary spikes while preserving the rest in integer format, thereby reducing the overall latency of SNN execution. Second, we introduce a silence threshold mechanism to regulate the timing of single-spike firing, ensuring the output is mathematically equivalent to the LLM's output and preserves accuracy. Experimental results demonstrate that Kirin, under a W4A4\&8 quantization setting, achieves near-FP16 accuracy while reducing energy consumption by up to 84.66\% and shortening time steps by 93.75\%.