🤖 AI Summary
This work addresses the scarcity of multimodal perception data and the lack of a unified benchmark for sensor fusion under complex weather and lighting conditions. To this end, the authors introduce a novel, synchronously captured multimodal driving dataset that, for the first time, systematically integrates stereo event cameras, RGB, thermal imaging, 4D radar, and dual LiDAR, accompanied by precise annotations including object trajectory IDs and ego-vehicle odometry. Building upon this dataset, they propose a unified 2D/3D object detection benchmark and a cross-modal feature-space fusion framework that enables fair comparison across diverse sensor configurations. Experimental results demonstrate that the proposed approach significantly enhances the robustness and accuracy of 3D object detection across a wide range of challenging environmental conditions.
📝 Abstract
In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting.