🤖 AI Summary
To address the challenges of inferior performance, high training overhead, and lack of architecture specialization in Spiking Neural Networks (SNNs) for large-scale speech recognition, this paper proposes IML-Spikeformer—a hybrid SNN-Transformer architecture integrating Input-aware Multi-level spiking mechanisms and Reparameterized Spiking Self-Attention (RepSSA). Key innovations include: (1) an input-aware adaptive threshold mechanism enabling multi-step spike simulation within a single time step; and (2) the HD-RepSSA module enhanced with Hierarchical Decay Masking (HDM), improving multi-scale temporal modeling accuracy while substantially reducing energy consumption. Evaluated on AIShell-1 and LibriSpeech-960, IML-Spikeformer achieves word error rates of 6.0% and 3.4%, respectively—on par with state-of-the-art ANN-based Transformers—while delivering theoretical inference energy reductions of 4.64× and 4.32×. This work marks the first demonstration of jointly achieving high accuracy and high energy efficiency in large-scale speech SNNs.
📝 Abstract
Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulate multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Reparameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64$ imes$ and 4.32$ imes$ respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency.