UAVScenes: A Multi-Modal Dataset for UAVs

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing UAV multimodal datasets primarily target localization or 3D reconstruction, lacking frame-synchronized, fine-grained semantic annotations for both camera images and LiDAR point clouds, as well as high-accuracy 6-DoF pose ground truth—thus hindering advanced scene understanding. To address this, we introduce the first large-scale, synchronized multimodal UAV dataset explicitly designed for joint 2D/3D understanding. It provides, for the first time, per-frame semantic segmentation labels for both RGB images and LiDAR point clouds, coupled with centimeter-level accurate 6-DoF poses. Built upon the MARS-LVIG platform, the dataset undergoes rigorous sensor calibration and meticulous manual annotation, covering diverse, complex environments including urban streets and industrial campuses. It supports benchmarking across multiple tasks—including semantic segmentation, depth estimation, (re)localization, and novel-view synthesis—and is publicly released to accelerate research and evaluation of high-level UAV environmental perception algorithms.

Technology Category

Application Category

📝 Abstract

Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes

Problem

Research questions and friction points this paper is trying to address.

Lack of frame-wise annotations in UAV datasets for high-level scene understanding.

Existing UAV datasets are biased towards localization and 3D reconstruction.

Need for a multi-modal dataset supporting diverse UAV perception tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal UAV dataset with semantic annotations

Frame-wise labeled images and LiDAR point clouds

Supports 2D and 3D perception tasks

🔎 Similar Papers

UEMM-Air: A Synthetic Multi-modal Dataset for Unmanned Aerial Vehicle Object Detection