🤖 AI Summary
This paper addresses 3D object reconstruction from monocular videos without camera pose annotations. We propose the first end-to-end method that requires neither synthetic data nor manual pose supervision. Our approach introduces two key innovations: (1) an implicit pose-invariant feature aggregation mechanism, implemented via a Transformer to enable robust cross-frame feature fusion; and (2) a diffusion-prior-driven pseudo-novel-view synthesis framework that jointly optimizes geometry and appearance in an analysis-by-synthesis paradigm—integrating tri-plane implicit representations with Score Distillation Sampling (SDS). Evaluated on G-Objaverse and CO3D, our method achieves high-fidelity and diverse object reconstructions under zero pose supervision. It significantly improves generalization to real-world scenes and enhances training scalability compared to prior approaches relying on explicit pose labels or synthetic data.
📝 Abstract
Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.