IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of inferior performance, high training overhead, and lack of architecture specialization in Spiking Neural Networks (SNNs) for large-scale speech recognition, this paper proposes IML-Spikeformer—a hybrid SNN-Transformer architecture integrating Input-aware Multi-level spiking mechanisms and Reparameterized Spiking Self-Attention (RepSSA). Key innovations include: (1) an input-aware adaptive threshold mechanism enabling multi-step spike simulation within a single time step; and (2) the HD-RepSSA module enhanced with Hierarchical Decay Masking (HDM), improving multi-scale temporal modeling accuracy while substantially reducing energy consumption. Evaluated on AIShell-1 and LibriSpeech-960, IML-Spikeformer achieves word error rates of 6.0% and 3.4%, respectively—on par with state-of-the-art ANN-based Transformers—while delivering theoretical inference energy reductions of 4.64× and 4.32×. This work marks the first demonstration of jointly achieving high accuracy and high energy efficiency in large-scale speech SNNs.

Technology Category

Application Category

📝 Abstract
Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulate multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Reparameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64$ imes$ and 4.32$ imes$ respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency.
Problem

Research questions and friction points this paper is trying to address.

High computational overhead in SNN training due to multi-timestep spikes
Lack of large-scale SNN architectures for speech processing tasks
Need for energy-efficient yet competitive speech processing models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Input-aware Multi-Level Spike (IMLS) mechanism
Reparameterized Spiking Self-Attention (RepSSA) module
Hierarchical Decay Mask (HDM) for temporal dependencies
🔎 Similar Papers
No similar papers found.