Unveiling the Spatial-temporal Effective Receptive Fields of Spiking Neural Networks

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Spiking Neural Networks (SNNs) suffer from limited spatiotemporal effective receptive fields (ST-ERFs) in visual long-sequence modeling, hindering global spatiotemporal dependency capture. This work introduces the first quantitative ST-ERF analysis framework, uncovering fundamental bottlenecks in ST-ERF evolution across existing SNNs—including transformer-inspired architectures. Guided by this analysis, we propose two lightweight channel-mixing modules—MLPixer and Spatiotemporal Receptive-field Booster (SRB)—that enable cross-timestep global spatial receptive field expansion early in the network. Our approach synergistically integrates spike-based dynamics with transformer-like structural priors, supporting event-driven computation. Extensive experiments on the Meta-SDT model family demonstrate substantial improvements in both object detection and semantic segmentation tasks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Spiking Neural Networks (SNNs) demonstrate significant potential for energy-efficient neuromorphic computing through an event-driven paradigm. While training methods and computational models have greatly advanced, SNNs struggle to achieve competitive performance in visual long-sequence modeling tasks. In artificial neural networks, the effective receptive field (ERF) serves as a valuable tool for analyzing feature extraction capabilities in visual long-sequence modeling. Inspired by this, we introduce the Spatio-Temporal Effective Receptive Field (ST-ERF) to analyze the ERF distributions across various Transformer-based SNNs. Based on the proposed ST-ERF, we reveal that these models suffer from establishing a robust global ST-ERF, thereby limiting their visual feature modeling capabilities. To overcome this issue, we propose two novel channel-mixer architectures: underline{m}ulti-underline{l}ayer-underline{p}erceptron-based munderline{ixer} (MLPixer) and underline{s}plash-and-underline{r}econstruct underline{b}lock (SRB). These architectures enhance global spatial ERF through all timesteps in early network stages of Transformer-based SNNs, improving performance on challenging visual long-sequence modeling tasks. Extensive experiments conducted on the Meta-SDT variants and across object detection and semantic segmentation tasks further validate the effectiveness of our proposed method. Beyond these specific applications, we believe the proposed ST-ERF framework can provide valuable insights for designing and optimizing SNN architectures across a broader range of tasks. The code is available at href{https://github.com/EricZhang1412/Spatial-temporal-ERF}{faGithub~EricZhang1412/Spatial-temporal-ERF}.
Problem

Research questions and friction points this paper is trying to address.

SNNs struggle with visual long-sequence modeling performance
Transformer-based SNNs lack robust global spatio-temporal receptive fields
Limited feature extraction capabilities hinder SNN visual tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing Spatio-Temporal Effective Receptive Field for SNNs
Proposing MLPixer and SRB channel-mixer architectures
Enhancing global spatial ERF across all timesteps
🔎 Similar Papers
No similar papers found.
J
Jieyuan Zhang
University of Electronic Science and Technology of China
Xiaolong Zhou
Xiaolong Zhou
Professor, Quzhou University, Quzhou, China
Visual trackingGaze estimationComputer vision
S
Shuai Wang
University of Electronic Science and Technology of China
Wenjie Wei
Wenjie Wei
University of Electronic Science and Technology of China
Spiking Neural NetworkNeuromorphic ComputingModel CompressionEvent-based Vision
H
Hanwen Liu
University of Electronic Science and Technology of China
Q
Qian Sun
University of Electronic Science and Technology of China
M
Malu Zhang
University of Electronic Science and Technology of China, Shenzhen Loop Area Institute
Y
Yang Yang
University of Electronic Science and Technology of China
Haizhou Li
Haizhou Li
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; NUS, Singapore
Automatic Speech RecognitionSpeaker RecognitionLanguage RecognitionVoice ConversionMachine Translation