🤖 AI Summary
Spiking Neural Networks (SNNs) face significant challenges in scaling self-supervised learning to large-scale unlabeled data, primarily because spike discreteness disrupts cross-view gradient consistency, hindering optimization of contrastive and consistency objectives.
Method: We propose a dual-path neuron architecture that jointly integrates a differentiable surrogate branch—enabling gradient propagation during training—and a genuine spiking branch—preserving full spike dynamics during inference. Coupled with cross-view and temporal alignment losses, this design enhances inter-sample representation consistency within both convolutional and Transformer-based SNNs.
Contribution/Results: This work achieves the first full self-supervised pretraining of SNNs at ImageNet scale. Our Spikformer-16-512 model attains 70.1% top-1 accuracy on ImageNet-1K, demonstrating the feasibility of high-capacity SNNs for unsupervised learning at modern scales.
📝 Abstract
Spiking neural networks (SNNs) exhibit temporal, sparse, and event-driven dynamics that make them appealing for efficient inference. However, extending these models to self-supervised regimes remains challenging because the discontinuities introduced by spikes break the cross-view gradient correspondences required by contrastive and consistency-driven objectives. This work introduces a training paradigm that enables large SNN architectures to be optimized without labeled data. We formulate a dual-path neuron in which a spike-generating process is paired with a differentiable surrogate branch, allowing gradients to propagate across augmented inputs while preserving a fully spiking implementation at inference. In addition, we propose temporal alignment objectives that enforce representational coherence both across spike timesteps and between augmented views. Using convolutional and transformer-style SNN backbones, we demonstrate ImageNet-scale self-supervised pretraining and strong transfer to classification, detection, and segmentation benchmarks. Our best model, a fully self-supervised Spikformer-16-512, achieves 70.1% top-1 accuracy on ImageNet-1K, demonstrating that unlabeled learning in high-capacity SNNs is feasible at modern scale