UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of complex backgrounds, small-scale multi-pedestrian motion, and absence of depth sensors in monocular UAV-mounted top-down videos, this paper proposes the first high-fidelity neural rendering framework for multi-pedestrian scenes under UAV settings. Methodologically, we introduce a world-coordinate-scale decoupling strategy based on human–scene contact points—the first of its kind—and jointly initialize 3D Gaussian splatting for both SMPL-based human avatars and background meshes, enabling unified human–scene 3D modeling. Our approach further integrates priors from 3D foundation models, monocular human pose estimation, and dynamic Gaussian optimization. Evaluated on three major UAV benchmarks—including VisDrone—our method achieves a 1.5 dB PSNR gain in novel-view synthesis, with marked improvements in detail sharpness and motion consistency. This work establishes a new paradigm for neural rendering of dynamic scenes without auxiliary depth sensors.

Technology Category

Application Category

📝 Abstract
Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10~50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.
Problem

Research questions and friction points this paper is trying to address.

Rendering dynamic UAV scenes with multiple small humans
Resolving scale ambiguity in monocular UAV footage
Improving novel view synthesis for top-down perspectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D foundation model for scene reconstruction
Human-scene contact points for scale resolution
SMPL model for Gaussian splats initialization
🔎 Similar Papers
No similar papers found.