UAVScenes: A Multi-Modal Dataset for UAVs

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing UAV multimodal datasets primarily target localization or 3D reconstruction, lacking frame-synchronized, fine-grained semantic annotations for both camera images and LiDAR point clouds, as well as high-accuracy 6-DoF pose ground truth—thus hindering advanced scene understanding. To address this, we introduce the first large-scale, synchronized multimodal UAV dataset explicitly designed for joint 2D/3D understanding. It provides, for the first time, per-frame semantic segmentation labels for both RGB images and LiDAR point clouds, coupled with centimeter-level accurate 6-DoF poses. Built upon the MARS-LVIG platform, the dataset undergoes rigorous sensor calibration and meticulous manual annotation, covering diverse, complex environments including urban streets and industrial campuses. It supports benchmarking across multiple tasks—including semantic segmentation, depth estimation, (re)localization, and novel-view synthesis—and is publicly released to accelerate research and evaluation of high-level UAV environmental perception algorithms.

Technology Category

Application Category

📝 Abstract
Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes
Problem

Research questions and friction points this paper is trying to address.

Lack of frame-wise annotations in UAV datasets for high-level scene understanding.
Existing UAV datasets are biased towards localization and 3D reconstruction.
Need for a multi-modal dataset supporting diverse UAV perception tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal UAV dataset with semantic annotations
Frame-wise labeled images and LiDAR point clouds
Supports 2D and 3D perception tasks
S
Sijie Wang
Nanyang Technological University
S
Siqi Li
Nanyang Technological University
Y
Yawei Zhang
Nanyang Technological University
Shangshu Yu
Shangshu Yu
Nanyang Technological University
3D Computer VisionLiDAR LocalizationDepth EstimationPose Estimation
S
Shenghai Yuan
Nanyang Technological University
Rui She
Rui She
BUAA<<NTU<<THU
Intelligent information processingcomputer visionembodied AIAIoTinformation theory
Quanjiang Guo
Quanjiang Guo
University of Electronic Science and Technology of China
NLPKnowledge GraphLLM
J
JinXuan Zheng
Nanyang Technological University
O
Ong Kang Howe
Nanyang Technological University
L
Leonrich Chandra
Nanyang Technological University
S
Shrivarshann Srijeyan
Nanyang Technological University
A
Aditya Sivadas
Nanyang Technological University
T
Toshan Aggarwal
Nanyang Technological University
H
Heyuan Liu
Nanyang Technological University
H
Hongming Zhang
Nanyang Technological University
C
Chujie Chen
Nanyang Technological University
J
Junyu Jiang
Nanyang Technological University
Lihua Xie
Lihua Xie
Professor of Electrical Engineering, Nanyang Technological University
Robust controlNetworked ControlMult-agent Systems
Wee Peng Tay
Wee Peng Tay
Nanyang Technological University
information processinggraph signal processinggraph neural networksrobust machine learning