The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Joint estimation of camera pose, 3D scene geometry, and object motion in dynamic scenes remains challenging: conventional SfM/SLAM methods suffer from dynamic-object interference, while existing learning-based approaches rely heavily on scarce motion segmentation annotations and exhibit poor generalization. This paper introduces Dynamic Prior—a task-agnostic prior that, for the first time, synergistically integrates the semantic reasoning capability of vision-language models (VLMs) with the pixel-accurate segmentation power of SAM2 to enable semantic-aware, adaptive identification of dynamic regions. Dynamic Prior is plug-and-play compatible with standard 3D reconstruction pipelines, jointly enhancing robustness in camera pose optimization, depth estimation, and 4D trajectory inference. Experiments demonstrate state-of-the-art motion segmentation performance on both synthetic and real-world dynamic videos, along with significant improvements in 3D structural recovery accuracy.

Technology Category

Application Category

📝 Abstract

Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.

Problem

Research questions and friction points this paper is trying to address.

Estimating camera poses and 3D geometry from dynamic videos

Overcoming limitations of motion segmentation datasets for dynamic object identification

Improving accuracy in 3D scene understanding without task-specific training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models to identify dynamic objects

Integrates SAM2 for fine-grained spatial segmentation

Enhances 3D reconstruction pipelines without task-specific training

🔎 Similar Papers

No similar papers found.

Authors to Follow