🤖 AI Summary
This work addresses key limitations in unsupervised single-image 3D reconstruction—namely, reliance on multi-view supervision, high computational cost, and weak geometric constraints. We propose a novel diffusion-model-based monocular geometric prior framework. Our method introduces three core innovations: (1) the “Analysis by Augmentation” paradigm, which first reveals and explicitly leverages implicit monocular shape priors encoded in pre-trained diffusion models; (2) conformal mapping as a memory-efficient alternative to volumetric representations; and (3) differentiable rendering to align virtual textures with monocular depth maps, enabling end-to-end unsupervised optimization. Experiments demonstrate substantial improvements in geometric fidelity and reconstruction efficiency without multi-view supervision, validating that generative models encode extractable and transferable monocular structural priors.
📝 Abstract
DreamFusion established a new paradigm for unsupervised 3D reconstruction from virtual views by combining advances in generative models and differentiable rendering. However, the underlying multi-view rendering, along with supervision from large-scale generative models, is computationally expensive and under-constrained. We propose DreamTexture, a novel Shape-from-Virtual-Texture approach that leverages monocular depth cues to reconstruct 3D objects. Our method textures an input image by aligning a virtual texture with the real depth cues in the input, exploiting the inherent understanding of monocular geometry encoded in modern diffusion models. We then reconstruct depth from the virtual texture deformation with a new conformal map optimization, which alleviates memory-intensive volumetric representations. Our experiments reveal that generative models possess an understanding of monocular shape cues, which can be extracted by augmenting and aligning texture cues -- a novel monocular reconstruction paradigm that we call Analysis by Augmentation.