Ge$^\text{2}$mS-T: Multi-Dimensional Grouping for Ultra-High Energy Efficiency in Spiking Transformer

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Spiking Vision Transformers (S-ViTs) face significant challenges in jointly optimizing memory efficiency, accuracy, and energy consumption during both training and inference. This work proposes the first multidimensional grouped computation framework tailored for S-ViTs, enabling efficient co-optimization across temporal, spatial, and architectural dimensions. The core innovations include an integrate-and-fire (IF) neuron model based on grouped exponential encoding (ExpG-IF) and a multiplication-free intra-group self-attention mechanism (GW-SSA), further enhanced by a hybrid attention-convolution architecture. Evaluated on multiple benchmarks, the proposed approach substantially outperforms existing ANN-to-SNN conversion methods and surrogate gradient-based STBP techniques, achieving state-of-the-art performance with exceptional energy efficiency.

Technology Category

Application Category

📝 Abstract

Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge$^\text{2}$mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions. Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks. To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.

Problem

Research questions and friction points this paper is trying to address.

Spiking Vision Transformers

energy efficiency

memory overhead

learning capability

SNNs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spiking Transformer

Multi-Dimensional Grouping

Energy Efficiency