On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Resource-constrained agile robots (e.g., micro-drones) demand ultra-low-latency, energy-efficient monocular depth estimation without reliance on ground-truth supervision or high-power RGB cameras. Method: We propose the first on-device, online self-supervised monocular depth estimation framework for event cameras—leveraging only low-power, low-latency event streams. Our approach integrates a contrast maximization self-supervision objective, a lightweight CNN architecture, voxel-grid event encoding, online gradient clipping, and memory-aware inference scheduling. Contribution/Results: The method achieves state-of-the-art performance on standard event-based depth benchmarks. In real-world flight, it enables sub-12 ms end-to-end latency for obstacle avoidance. With only minutes of onboard fine-tuning, it adapts across domains, reducing depth error by 32% (on Jetson Orin). Crucially, it is the first to realize full-pipeline on-device training for event-driven depth estimation—significantly improving both latency and memory efficiency.

Technology Category

Application Category

📝 Abstract
Event cameras provide low-latency perception for only milliwatts of power. This makes them highly suitable for resource-restricted, agile robots such as small flying drones. Self-supervised learning based on contrast maximization holds great potential for event-based robot vision, as it foregoes the need to high-frequency ground truth and allows for online learning in the robot's operational environment. However, online, onboard learning raises the major challenge of achieving sufficient computational efficiency for real-time learning, while maintaining competitive visual perception performance. In this work, we improve the time and memory efficiency of the contrast maximization learning pipeline. Benchmarking experiments show that the proposed pipeline achieves competitive results with the state of the art on the task of depth estimation from events. Furthermore, we demonstrate the usability of the learned depth for obstacle avoidance through real-world flight experiments. Finally, we compare the performance of different combinations of pre-training and fine-tuning of the depth estimation networks, showing that on-board domain adaptation is feasible given a few minutes of flight.
Problem

Research questions and friction points this paper is trying to address.

Enable low-latency monocular depth learning from event data
Achieve real-time on-device learning for resource-restricted robots
Improve computational efficiency without sacrificing depth estimation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-device self-supervised learning for efficiency
Contrast maximization for event-based depth estimation
Low-latency monocular depth from event cameras
J
J. Hagenaars
MA VLab, TU Delft
Y
Yilun Wu
MA VLab, TU Delft
Federico Paredes-Vallés
Federico Paredes-Vallés
PhD; Senior Research Engineer at Sony
Artificial IntelligenceNeuromorphic ComputingComputer VisionAerial Robotics
S
S. Stroobants
MA VLab, TU Delft
G
Guido C. H. E. de Croon
MA VLab, TU Delft