UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-altitude UAV 3D perception faces critical challenges including scarcity of real-world data, high annotation costs for multimodal sensors, and difficulty in cross-modal alignment. To address these, this paper introduces UAV-MM3D—the first multimodal synthetic benchmark tailored for low-altitude UAVs. Leveraging high-fidelity physics-based simulation, UAV-MM3D synchronously generates RGB, infrared, LiDAR, radar, and event-camera data across diverse scenes and weather conditions, yielding a large-scale synthetic dataset of 400K frames with 2D/3D bounding boxes, 6-DoF pose annotations, and instance-level labels. We further propose LGFusionNet—a LiDAR-guided multimodal fusion network—and a trajectory prediction model, significantly reducing reliance on real-world data. The benchmark and baseline models establish a standardized evaluation platform for 3D detection, 6-DoF pose estimation, and trajectory prediction, thereby enhancing algorithm generalizability and reproducibility.

Technology Category

Application Category

📝 Abstract
Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale multimodal data for UAV perception
High cost and difficulty in real-world UAV data collection
Need for benchmarks for 3D detection and trajectory forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic multimodal dataset with 400K synchronized frames
LiDAR-guided fusion baseline for UAV perception tasks
Controllable simulation covering diverse scenes and weather conditions
🔎 Similar Papers
No similar papers found.
L
Longkun Zou
Pengcheng Laboratory
Jiale Wang
Jiale Wang
HKUST, BUPT
Medical robots
R
Rongqin Liang
Pengcheng Laboratory
Hai Wu
Hai Wu
The University of Hong Kong
K
Ke Chen
Pengcheng Laboratory
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University