🤖 AI Summary
To address the sparsity and asynchrony of event camera data, along with the high computational cost and parameter complexity of existing methods, this paper proposes a lightweight spatiotemporal attention learning framework. For the first time, the Convolutional Block Attention Module (CBAM) is integrated into a VGG backbone—enabling effective modeling of spatiotemporal dependencies in event streams without pretraining or heavy data augmentation. This design reduces model parameters below those of the original VGG, significantly enhancing deployment feasibility. Evaluated on standard benchmarks CIFAR10-DVS and N-Caltech101, the method achieves 71.3% and 72.4% Top-1 accuracy without pretraining—outperforming ResNet-based approaches such as MVF-Net; with pretraining, it reaches 76.4% on CIFAR10-DVS. Key contributions include: (i) CBAM-VGG—a lightweight, event-stream-specific adaptation; (ii) strong generalization without pretraining; and (iii) an end-to-end recognition framework that balances accuracy and efficiency.
📝 Abstract
Event-based vision sensors, inspired by biological neural systems, asynchronously capture local pixel-level intensity changes as a sparse event stream containing position, polarity, and timestamp information. These neuromorphic sensors offer significant advantages in dynamic range, latency, and power efficiency. Their working principle inherently addresses traditional camera limitations such as motion blur and redundant background information, making them particularly suitable for dynamic vision tasks. While recent works have proposed increasingly complex event-based architectures, the computational overhead and parameter complexity of these approaches limit their practical deployment. This paper presents a novel spatiotemporal learning framework for event-based object recognition, utilizing a VGG network enhanced with Convolutional Block Attention Module (CBAM). Our approach achieves comparable performance to state-of-the-art ResNet-based methods while reducing parameter count by 2.3% compared to the original VGG model. Specifically, it outperforms ResNet-based methods like MVF-Net, achieving the highest Top-1 accuracy of 76.4% (pretrained) and 71.3% (not pretrained) on CIFAR10-DVS, and 72.4% (not pretrained) on N-Caltech101. These results highlight the robustness of our method when pretrained weights are not used, making it suitable for scenarios where transfer learning is unavailable. Moreover, our approach reduces reliance on data augmentation. Experimental results on standard event-based datasets demonstrate the framework's efficiency and effectiveness for real-world applications.