🤖 AI Summary
Existing UAV multimodal datasets primarily target localization or 3D reconstruction, lacking frame-synchronized, fine-grained semantic annotations for both camera images and LiDAR point clouds, as well as high-accuracy 6-DoF pose ground truth—thus hindering advanced scene understanding. To address this, we introduce the first large-scale, synchronized multimodal UAV dataset explicitly designed for joint 2D/3D understanding. It provides, for the first time, per-frame semantic segmentation labels for both RGB images and LiDAR point clouds, coupled with centimeter-level accurate 6-DoF poses. Built upon the MARS-LVIG platform, the dataset undergoes rigorous sensor calibration and meticulous manual annotation, covering diverse, complex environments including urban streets and industrial campuses. It supports benchmarking across multiple tasks—including semantic segmentation, depth estimation, (re)localization, and novel-view synthesis—and is publicly released to accelerate research and evaluation of high-level UAV environmental perception algorithms.
📝 Abstract
Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes