MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spiking neural network (SNN)–vision Transformer hybrids suffer from limited multi-scale feature extraction capability. To address this, we propose the Multi-Scale Spiking Vision Transformer (MSVIT), whose core innovation is the first-ever Multi-Scale Spiking Self-Attention (MSSA) mechanism. MSSA models event-driven cross-scale spiking responses, effectively alleviating the feature representation bottleneck imposed by the constrained spatiotemporal resolution inherent in SNNs. MSVIT deeply integrates spike dynamics with hierarchical attention, preserving the energy efficiency of SNNs while substantially enhancing representational capacity. Evaluated on ImageNet, CIFAR-10, and CIFAR-100, MSVIT significantly outperforms prior SNN–Transformer approaches, establishing new state-of-the-art performance in the spiking vision Transformer domain.

Technology Category

Application Category

📝 Abstract
The combination of Spiking Neural Networks(SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
Problem

Research questions and friction points this paper is trying to address.

Bridging performance gap between SNN and ANN transformers
Enhancing multi-scale feature extraction in spiking attention
Improving energy-efficient vision transformer with SNNs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale spiking attention (MSSA) for feature extraction
Novel spike-driven Transformer architecture (MSVIT)
State-of-the-art SNN-transformer performance
🔎 Similar Papers
No similar papers found.