🤖 AI Summary
Existing single-image-to-3D scene generation methods suffer from geometric distortions and texture blurriness, primarily due to inherent limitations in monocular depth estimation. To address this, we propose Video2Scene—a novel framework that leverages a video diffusion model to synthesize multi-view frames, extracts globally consistent geometric features from them, and reconstructs 3D scenes using a predefined camera trajectory. Our key contributions are: (1) a geometric alignment loss enforcing structural consistency across multi-frame depth maps under camera motion constraints; and (2) a lightweight geometric adaptation module enhancing cross-frame geometric feature transferability and utilization. Evaluated on ScanNet and Matterport3D, Video2Scene significantly outperforms state-of-the-art methods, achieving substantial improvements in PSNR, LPIPS, and Chamfer Distance—quantifying both visual fidelity and geometric accuracy. Qualitative results further confirm that reconstructed scenes exhibit high geometric precision and photorealistic texture quality.
📝 Abstract
Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.