🤖 AI Summary
To address core challenges in monocular RGB video dynamic scene reconstruction—including distortion of thin structures, depth inconsistency, floating artifacts, and motion-geometry incoherence—this paper proposes an object-aware Gaussian rasterization framework. Methodologically, it introduces three novel components: (1) a mask-guided object-level depth loss, (2) skeleton-based sampling with mask-driven re-identification, and (3) virtual-view depth supervision coupled with scaffold projection modeling, explicitly enforcing consistency between 3D motion nodes and 2D trajectories while suppressing floating objects. The technical pipeline integrates video segmentation, epipolar error map optimization, and multi-source geometric supervision. Evaluated on standard benchmarks, our approach comprehensively outperforms state-of-the-art methods, achieving significant improvements in geometric accuracy, motion coherence, and texture fidelity. To our knowledge, it is the first method enabling fully automatic, high-quality dynamic scene reconstruction from arbitrarily captured monocular videos.
📝 Abstract
We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings