🤖 AI Summary
Existing Transformer-based spiking neural networks (SNNs) suffer from the loss of local features due to max pooling and high computational redundancy in global self-attention, both of which contradict the inherent sparsity and energy efficiency of SNNs. To address these limitations, this work proposes LSFormer, the first Transformer-based SNN architecture incorporating a local dilated window mechanism. LSFormer integrates spike-aware pooling (SPooling) with a locality-structure-aware spiking self-attention (LS-SSA) module, effectively capturing both fine-grained local details and long-range dependencies while preserving network sparsity. The proposed method achieves state-of-the-art performance on benchmark datasets, surpassing the current best approaches by 4.3% and 8.6% in Top-1 accuracy on Tiny-ImageNet and N-CALTECH101, respectively.
📝 Abstract
Transformer-based Spiking Neural Networks (SNNs) integrate SNNs with global self-attention and have demonstrated impressive performance. However, existing Transformer-based SNNs suffer from two fundamental limitations. First, they typically employ max pooling layers to reduce the size of feature maps, but the max pooling captures only the strongest response and fails to comprehensively preserve representative regional features. Second, the global self-attention involves all global feature interactions, resulting in computational redundancy and quadratic computational complexity, thus conflicting with the sparse and energy-efficient characteristics of SNNs. To address these challenges, we develop Local Structure-Aware Spiking Transformer (LSFormer), a novel Transformer-based Spiking Neural Network that incorporates Spiking Response Pooling (SPooling) and Local Structure-Aware Spiking Self-Attention (LS-SSA). For the first time, our LSFormer leverages a local dilated window mechanism to capture both local details and long-range dependencies. Experimental results demonstrate that our LSFormer achieves state-of-the-art performance compared to existing advanced Transformer-based SNNs. Notably, on the more challenging static dataset Tiny-ImageNet and neuromorphic dataset N-CALTECH101, LSFormer substantially outperforms state-of-the-art baselines by 4.3\% and 8.6\% in top-1 classification accuracy, respectively. These results highlight the potential of LSFormer to advance energy-efficient spiking models toward practical deployment in large-scale vision applications.