🤖 AI Summary
This work addresses the challenge of limited large-scale annotated data in event-based vision, which hinders the application of spiking neural networks in few-shot scenarios. To overcome this, the authors propose SpikeCLR, a novel framework that introduces contrastive self-supervised learning to event-based spiking vision for the first time. The method incorporates a tailored spatio-temporal-polarity augmentation strategy designed specifically for event data, enabling robust representation learning from unlabeled data. These learned representations are subsequently fine-tuned to enhance performance in both few-shot and semi-supervised settings. Experimental results on benchmarks such as CIFAR10-DVS and N-Caltech101 demonstrate that SpikeCLR achieves superior performance compared to fully supervised models while using significantly fewer labeled samples, thereby validating the generalization capability and transfer effectiveness of the learned representations.
📝 Abstract
Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.