4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses novel view synthesis in dynamic scenes from monocular videos under unknown camera poses. Methodologically, it proposes a motion-aware neural 4D reconstruction and rendering framework based on a two-stage static/dynamic decomposition paradigm. First, a 3D foundation model initializes scene geometry and camera poses, followed by motion-aware bundle adjustment (MA-BA) for trajectory optimization. Second, a motion-aware bundle tuning module and an efficient motion-aware Gaussian splatting (MA-GS) representation are introduced; instance-level dynamic segmentation is achieved via Transformer priors and SAM2, while dynamic motion is modeled using a control-point-driven MLP deformation field and linear blend skinning. Evaluated on real-world dynamic datasets, the method achieves a 1.8 dB PSNR improvement and reduces computational overhead by 5× compared to state-of-the-art approaches—particularly excelling in scenes with large-scale dynamic objects.

Technology Category

Application Category

📝 Abstract
Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing dynamic scenes from monocular videos with unknown camera poses
Separating static and dynamic components in neural scene representations
Reducing computational costs while maintaining high-quality dynamic reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-aware bundle adjustment for pose refinement
Motion-Aware Gaussian Splatting for dynamic modeling
Two-stage static-dynamic decoupling framework
🔎 Similar Papers
No similar papers found.