🤖 AI Summary
This paper addresses the challenge of jointly modeling dynamic foregrounds (e.g., moving humans) and complex backgrounds in real-world UAV scenarios, which leads to low-fidelity digital twins. To this end, we propose a high-fidelity neural digital twin framework tailored for UAV perception. For the first time, we introduce 3D Gaussian Splatting (3DGS) into UAV-view modeling, integrating controllable human generation, neural rendering, and mask refinement modules to effectively mitigate reconstruction artifacts caused by dynamic object deformation and appearance variation. Our data augmentation framework enables efficient training of downstream models on synthetic data. Experiments demonstrate a 1.23 dB improvement in reconstruction PSNR and a 2.5–13.7% gain in human detection mAP under UAV viewpoints, significantly enhancing robustness for multi-dynamic-object perception.
📝 Abstract
We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.