π€ AI Summary
This work addresses the challenge of robust indoor navigation for mobile robots under low-light and fast-motion conditions, where conventional RGB cameras often fail and effective end-to-end control methods leveraging event cameras remain scarce. The authors introduce a novel real-world indoor follow-the-person dataset, synchronously capturing event streams, RGB images, and expert control commands. They propose a late-fusion RGB-event navigation strategy based on behavioral cloning, employing dual MobileNet encoders and a Transformer-based fusion module for multimodal imitation learning. This study presents the first demonstration of event cameraβbased end-to-end navigation in real low-light indoor environments. The proposed method significantly outperforms RGB-only baselines in unseen scenes, achieving lower action prediction error and confirming the critical role of event data in enhancing policy robustness and environmental adaptability.
π Abstract
Event cameras provide high dynamic range and microsecond-level temporal resolution, making them well-suited for indoor robot navigation, where conventional RGB cameras degrade under fast motion or low-light conditions. Despite advances in event-based perception spanning detection, SLAM, and pose estimation, there remains limited research on end-to-end control policies that exploit the asynchronous nature of event streams. To address this gap, we introduce a real-world indoor person-following dataset collected using a TurtleBot 2 robot, featuring synchronized raw event streams, RGB frames, and expert control actions across multiple indoor maps, trajectories under both normal and low-light conditions. We further build a multimodal data preprocessing pipeline that temporally aligns event and RGB observations while reconstructing ground-truth actions from odometry to support high-quality imitation learning. Building on this dataset, we propose a late-fusion RGB-Event navigation policy that combines dual MobileNet encoders with a transformer-based fusion module trained via behavioral cloning. A systematic evaluation of RGB-only, Event-only, and RGB-Event fusion models across 12 training variations ranging from single-path imitation to general multi-path imitation shows that policies incorporating event data, particularly the fusion model, achieve improved robustness and lower action prediction error, especially in unseen low-light conditions where RGB-only models fail. We release the dataset, synchronization pipeline, and trained models at https://eventbasedvision.github.io/eNavi/