🤖 AI Summary
Addressing the significant modality gap between static RGB images and sparse event-stream data from dynamic vision sensors (DVS), as well as the resulting low-efficiency knowledge transfer, this paper proposes a cross-modal knowledge transfer framework, TMKT. Methodologically, TMKT introduces three key components: (1) probabilistic timestep Mixup to construct a smooth learning curriculum within event sequences; (2) modality-aware guidance (MAG) and mixture-ratio perception (MRP) mechanisms to explicitly align cross-modal temporal features and reduce gradient variance; and (3) cross-modal interpolation and supervision leveraging the inherent asynchrony of spiking neural networks (SNNs). Evaluated on multiple benchmark datasets and mainstream SNN backbones, TMKT consistently achieves substantial improvements in classification accuracy, demonstrating its effectiveness, robustness, and generalization capability across diverse modalities and architectures.
📝 Abstract
The integration of event cameras and spiking neural networks (SNNs) promises energy-efficient visual intelligence, yet scarce event data and the sparsity of DVS outputs hinder effective training. Prior knowledge transfers from RGB to DVS often underperform because the distribution gap between modalities is substantial. In this work, we present Time-step Mixup Knowledge Transfer (TMKT), a cross-modal training framework with a probabilistic Time-step Mixup (TSM) strategy. TSM exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time steps to produce a smooth curriculum within each sequence, which reduces gradient variance and stabilizes optimization with theoretical analysis. To employ auxiliary supervision from TSM, TMKT introduces two lightweight modality-aware objectives, Modality Aware Guidance (MAG) for per-frame source supervision and Mixup Ratio Perception (MRP) for sequence-level mix ratio estimation, which explicitly align temporal features with the mixing schedule. TMKT enables smoother knowledge transfer, helps mitigate modality mismatch during training, and achieves superior performance in spiking image classification tasks. Extensive experiments across diverse benchmarks and multiple SNN backbones, together with ablations, demonstrate the effectiveness of our method.