UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses 3D object reconstruction from monocular videos without camera pose annotations. We propose the first end-to-end method that requires neither synthetic data nor manual pose supervision. Our approach introduces two key innovations: (1) an implicit pose-invariant feature aggregation mechanism, implemented via a Transformer to enable robust cross-frame feature fusion; and (2) a diffusion-prior-driven pseudo-novel-view synthesis framework that jointly optimizes geometry and appearance in an analysis-by-synthesis paradigm—integrating tri-plane implicit representations with Score Distillation Sampling (SDS). Evaluated on G-Objaverse and CO3D, our method achieves high-fidelity and diverse object reconstructions under zero pose supervision. It significantly improves generalization to real-world scenes and enhances training scalability compared to prior approaches relying on explicit pose labels or synthetic data.

Technology Category

Application Category

📝 Abstract

Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.

Problem

Research questions and friction points this paper is trying to address.

3D modeling

casual videos

camera angle independence

Innovation

Methods, ideas, or system contributions that make the work stand out.

UVRM Model

Transformer Network

Score Distillation Sampling

🔎 Similar Papers

No similar papers found.