🤖 AI Summary
Existing spiking neural network (SNN)-based Transformers focus on single-frame images, limiting their ability to efficiently model spatiotemporal dynamics in videos. This work introduces the first spike-driven Transformer for video understanding, overcoming the single-frame constraint and achieving linear temporal complexity $mathcal{O}(T)$. Its core innovation is the theoretically grounded Spike-Driven Hamming Attention (SDHA), the first efficient and lightweight spatiotemporal attention mechanism explicitly designed for spike sequences, establishing an optimal spike-based spatiotemporal architecture. The model integrates event-driven feature extraction with Hamming distance computation. It achieves state-of-the-art performance on video classification, pose tracking, and semantic segmentation—outperforming prior SNN methods by over 15% on average, while matching the accuracy of advanced artificial neural networks (ANNs). Moreover, it delivers 16×, 10×, and 5× energy efficiency gains on these respective tasks.
📝 Abstract
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $ imes 16$, $ imes 10$ and $ imes 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer