๐ค AI Summary
To address the low energy efficiency and limited performance of ANN-SNN hybrid approaches in event-camera-based object tracking, this paper proposes Spiking MetaFormerโthe first fully spike-driven Transformer architecture. Its core innovation is the Global Trajectory Prompt (GTP) mechanism, which deeply fuses event streams with motion priors directly in the spike domain, enabling end-to-end tracking without data augmentation or post-processing. The method employs event-image encoding and spike-domain coordinate regression to establish a lightweight, low-power, high-performance spiking neural network (SNN) tracking baseline. Evaluated on multiple event-based tracking benchmarks, Spiking MetaFormer achieves state-of-the-art (SOTA) accuracy while attaining the lowest parameter count and energy consumption among existing methods. This work establishes a new paradigm and a new foundational baseline for neuromorphic vision-based tracking.
๐ Abstract
Event cameras provide superior temporal resolution, dynamic range, power efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches that combine Artificial Neural Networks (ANNs) and SNNs, along with suboptimal architectures, compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based spike-driven tracking pipeline. Our Global Trajectory Prompt (GTP) method effectively captures global trajectory information and aggregates it with event streams into event images to enhance spatiotemporal representation. We then introduce SDTrack, a Transformer-based spike-driven tracker comprising a Spiking MetaFormer backbone and a simple tracking head that directly predicts normalized coordinates using spike signals. The framework is end-to-end, does not require data augmentation or post-processing. Extensive experiments demonstrate that SDTrack achieves state-of-the-art performance while maintaining the lowest parameter count and energy consumption across multiple event-based tracking benchmarks, establishing a solid baseline for future research in the field of neuromorphic vision.