🤖 AI Summary
This work addresses the challenging problem of dynamic 3D reconstruction of deformable objects from unstructured, monocular videos without camera pose annotations—particularly under severe non-rigid deformation, large-scale camera motion, and sparse viewpoint coverage, where conventional methods fail. We propose the first pose-agnostic, category-agnostic framework for articulated 3D reconstruction. Our method integrates generative 3D priors with differentiable rendering and introduces an object-centric, personalized pose estimator. It jointly optimizes via pre-trained image-to-3D supervision, long-term 2D point trajectory regularization, and deformable 3D Gaussian optimization. Extensive evaluation across diverse dynamic scenes demonstrates strong robustness and generalization. Both qualitative and quantitative results significantly outperform state-of-the-art approaches, establishing new benchmarks for articulated reconstruction from unposed monocular video.
📝 Abstract
We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences featuring substantial object deformation, large-scale camera movement, and limited view coverage that typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D Gaussian representation. The optimization is further regularized by long-term 2D point tracking over the entire input video. By combining generative priors and differentiable rendering, PAD3R reconstructs high-fidelity, articulated 3D representations of objects in a category-agnostic way. Extensive qualitative and quantitative results show that PAD3R is robust and generalizes well across challenging scenarios, highlighting its potential for dynamic scene understanding and 3D content creation.