The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

πŸ“… 2025-12-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Joint estimation of camera pose, 3D scene geometry, and object motion in dynamic scenes remains challenging: conventional SfM/SLAM methods suffer from dynamic-object interference, while existing learning-based approaches rely heavily on scarce motion segmentation annotations and exhibit poor generalization. This paper introduces Dynamic Priorβ€”a task-agnostic prior that, for the first time, synergistically integrates the semantic reasoning capability of vision-language models (VLMs) with the pixel-accurate segmentation power of SAM2 to enable semantic-aware, adaptive identification of dynamic regions. Dynamic Prior is plug-and-play compatible with standard 3D reconstruction pipelines, jointly enhancing robustness in camera pose optimization, depth estimation, and 4D trajectory inference. Experiments demonstrate state-of-the-art motion segmentation performance on both synthetic and real-world dynamic videos, along with significant improvements in 3D structural recovery accuracy.

Technology Category

Application Category

πŸ“ Abstract
Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.
Problem

Research questions and friction points this paper is trying to address.

Estimating camera poses and 3D geometry from dynamic videos
Overcoming limitations of motion segmentation datasets for dynamic object identification
Improving accuracy in 3D scene understanding without task-specific training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models to identify dynamic objects
Integrates SAM2 for fine-grained spatial segmentation
Enhances 3D reconstruction pipelines without task-specific training
πŸ”Ž Similar Papers
No similar papers found.
Zhuoyuan Wu
Zhuoyuan Wu
PKU
X
Xurui Yang
Independent Researcher
Jiahui Huang
Jiahui Huang
NVIDIA
3D Computer VisionGraphics
Y
Yue Wang
USC
J
Jun Gao
University of Michigan