🤖 AI Summary
This work addresses the limitations of conventional frame-based cameras in autonomous driving, which suffer from motion blur under long exposure, high-speed dynamics, and varying illumination, thereby compromising the robustness of steering prediction. To overcome these challenges, the authors propose an energy-efficient imitation learning framework that synergistically integrates event cameras with frame cameras. The approach introduces two key innovations: an energy-driven cross-modal fusion module and an energy-aware decoder, marking the first effort to explicitly incorporate energy efficiency into multimodal driving control. Evaluated on the real-world DDD20 and DRFuser driving datasets, the proposed method significantly outperforms current state-of-the-art approaches, achieving a favorable balance between prediction accuracy and computational energy efficiency.
📝 Abstract
In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.