🤖 AI Summary
Spiking Vision Transformers (S-ViTs) face significant challenges in jointly optimizing memory efficiency, accuracy, and energy consumption during both training and inference. This work proposes the first multidimensional grouped computation framework tailored for S-ViTs, enabling efficient co-optimization across temporal, spatial, and architectural dimensions. The core innovations include an integrate-and-fire (IF) neuron model based on grouped exponential encoding (ExpG-IF) and a multiplication-free intra-group self-attention mechanism (GW-SSA), further enhanced by a hybrid attention-convolution architecture. Evaluated on multiple benchmarks, the proposed approach substantially outperforms existing ANN-to-SNN conversion methods and surrogate gradient-based STBP techniques, achieving state-of-the-art performance with exceptional energy efficiency.
📝 Abstract
Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge$^\text{2}$mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions. Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks. To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.