A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

📅 2024-04-26

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address the challenges of sparse, asynchronous spike data, scarce ground-truth annotations, and incompatibility with conventional deep learning pipelines in event-camera-based depth estimation, this paper proposes the first purely spike-driven Transformer network. Methodologically, we design an end-to-end spike-sequence encoder that operates directly on raw events—bypassing image reconstruction—and introduce a multi-stage spike feature fusion decoder for depth prediction. To mitigate annotation scarcity, we incorporate a cross-modal knowledge distillation framework leveraging DINOv2 features. Our key contributions are: (1) the first full adaptation of Transformers to raw event streams without frame-based representation or intensity reconstruction; and (2) state-of-the-art performance on major event-based depth benchmarks (ESIM, MVSEC), achieving superior accuracy while significantly reducing inference latency and energy consumption. This work establishes a new paradigm for efficient, neuromorphic vision-enabled depth perception.

Technology Category

Application Category

📝 Abstract

Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labelled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability.This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.

Problem

Research questions and friction points this paper is trying to address.

Depth estimation from event cameras

Overcoming spiking output challenges

Energy-efficient neuromorphic computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spike-driven transformer architecture

Fusion depth estimation head

Cross-modality knowledge distillation

🔎 Similar Papers

No similar papers found.