Spike-EVPR: Deep Spiking Residual Network with Cross-Representation Aggregation for Event-Based Visual Place Recognition

📅 2024-02-16

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Existing event-camera-based visual place recognition (VPR) methods typically densify sparse asynchronous events into frame-like representations, compromising sparsity and incurring high computational overhead. While spiking neural networks (SNNs) offer energy efficiency, they suffer from the lack of native event representation and insufficient capacity for discriminative global feature learning. To address these limitations, we propose the first SNN architecture that deeply integrates spatiotemporal dynamic modeling with cross-representation collaborative learning. Our design includes dual event representations (decoupling polarity and timestamp), a forked spiking residual encoder, a shared–specific descriptor extractor, and a cross-descriptor aggregation module. Evaluated on Brisbane-Event-VPR and DDD20, our method achieves Recall@1 improvements of 7.61% and 13.20%, respectively—outperforming all existing event-based VPR approaches. This work establishes the first efficient, robust, end-to-end, spike-driven framework for global scene representation.

Technology Category

Application Category

📝 Abstract

Event cameras have been successfully applied to visual place recognition (VPR) tasks by using deep artificial neural networks (ANNs) in recent years. However, previously proposed deep ANN architectures are often unable to harness the abundant temporal information presented in event streams. In contrast, deep spiking networks exhibit more intricate spatiotemporal dynamics and are inherently well-suited to process sparse asynchronous event streams. Unfortunately, directly inputting temporal-dense event volumes into the spiking network introduces excessive time steps, resulting in prohibitively high training costs for large-scale VPR tasks. To address the aforementioned issues, we propose a novel deep spiking network architecture called Spike-EVPR for event-based VPR tasks. First, we introduce two novel event representations tailored for SNN to fully exploit the spatio-temporal information from the event streams, and reduce the video memory occupation during training as much as possible. Then, to exploit the full potential of these two representations, we construct a Bifurcated Spike Residual Encoder (BSR-Encoder) with powerful representational capabilities to better extract the high-level features from the two event representations. Next, we introduce a Shared&Specific Descriptor Extractor (SSD-Extractor). This module is designed to extract features shared between the two representations and features specific to each. Finally, we propose a Cross-Descriptor Aggregation Module (CDA-Module) that fuses the above three features to generate a refined, robust global descriptor of the scene. Our experimental results indicate the superior performance of our Spike-EVPR compared to several existing EVPR pipelines on Brisbane-Event-VPR and DDD20 datasets, with the average Recall@1 increased by 7.61% on Brisbane and 13.20% on DDD20.

Problem

Research questions and friction points this paper is trying to address.

Develops SNN-tailored event representations to preserve spatio-temporal cues

Proposes deep spiking residual architecture for robust place descriptor generation

Enables energy-efficient event-based visual place recognition with direct SNN training

Innovation

Methods, ideas, or system contributions that make the work stand out.

SNN-tailored event representations reduce temporal redundancy

Deep spiking residual architecture aggregates features for descriptors

Directly trained end-to-end SNN framework enhances VPR efficiency

🔎 Similar Papers

Applications of Spiking Neural Networks in Visual Place Recognition