🤖 AI Summary
To address the high latency and computational overhead of Spiking Vision Transformers (SNN-ViTs) arising from multi-timestep inference, this paper proposes the first spatiotemporal co-adaptive computation framework. Unlike conventional Adaptive Computation Time (ACT) methods—which suffer from invalid temporal similarity assumptions and architectural rigidity in SNNs—our approach introduces an Integrated Spiking Block Segmentation (I-SPS) module to enhance temporal stability and a two-dimensional Adaptive Spiking Self-Attention (A-SSA) mechanism enabling joint spatial-temporal token pruning. Evaluated on CIFAR-10, CIFAR-100, and ImageNet, the method reduces energy consumption by 45.9%, 43.8%, and 30.1%, respectively, while surpassing state-of-the-art SNN-ViT models in classification accuracy. This work constitutes the first successful instantiation and empirical validation of the ACT principle in spiking ViT architectures.
📝 Abstract
Spiking neural networks (SNNs) offer energy efficiency over artificial neural networks (ANNs) but suffer from high latency and computational overhead due to their multi-timestep operational nature. While various dynamic computation methods have been developed to mitigate this by targeting spatial, temporal, or architecture-specific redundancies, they remain fragmented. While the principles of adaptive computation time (ACT) offer a robust foundation for a unified approach, its application to SNN-based vision Transformers (ViTs) is hindered by two core issues: the violation of its temporal similarity prerequisite and a static architecture fundamentally unsuited for its principles. To address these challenges, we propose STAS (Spatio-Temporal Adaptive computation time for Spiking transformers), a framework that co-designs the static architecture and dynamic computation policy. STAS introduces an integrated spike patch splitting (I-SPS) module to establish temporal stability by creating a unified input representation, thereby solving the architectural problem of temporal dissimilarity. This stability, in turn, allows our adaptive spiking self-attention (A-SSA) module to perform two-dimensional token pruning across both spatial and temporal axes. Implemented on spiking Transformer architectures and validated on CIFAR-10, CIFAR-100, and ImageNet, STAS reduces energy consumption by up to 45.9%, 43.8%, and 30.1%, respectively, while simultaneously improving accuracy over SOTA models.