Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current 2D vision foundation models exhibit strong generalization for monocular video dynamic scene understanding but lack 3D geometric consistency, leading to spatial misalignment and temporal flickering—hindering joint geometric and motion modeling of 4D scenes. To address this, we propose a novel framework that integrates priors from 2D foundation models with a 4D Gaussian splatting representation to achieve geometrically accurate and spatiotemporally consistent dynamic scene reconstruction. Methodologically, we design a two-stage iterative optimization: (i) adaptive Gaussian resampling and proliferation guided by a 3D confidence map; and (ii) SAM2-prompted iterative semantic refinement. Evaluated on point tracking, video object segmentation, and novel-view synthesis, our approach significantly outperforms state-of-the-art 2D and 3D methods, delivering high-fidelity, low-flicker 4D reconstructions. To our knowledge, this is the first method achieving simultaneous motion–semantic–geometry consistency directly from monocular video.

Technology Category

Application Category

📝 Abstract
Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d-web/.
Problem

Research questions and friction points this paper is trying to address.

Achieves 3D-consistent motion and semantics from monocular videos
Integrates 2D priors into a unified 4D Gaussian Splatting representation
Enhances scene understanding for tracking, segmentation, and view synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 2D priors into 4D Gaussian Splatting representation
Uses iterative optimization with sequential and global stages
Employs 3D confidence maps and adaptive resampling for accuracy
🔎 Similar Papers
No similar papers found.