GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-image-to-3D scene generation methods suffer from geometric distortions and texture blurriness, primarily due to inherent limitations in monocular depth estimation. To address this, we propose Video2Scene—a novel framework that leverages a video diffusion model to synthesize multi-view frames, extracts globally consistent geometric features from them, and reconstructs 3D scenes using a predefined camera trajectory. Our key contributions are: (1) a geometric alignment loss enforcing structural consistency across multi-frame depth maps under camera motion constraints; and (2) a lightweight geometric adaptation module enhancing cross-frame geometric feature transferability and utilization. Evaluated on ScanNet and Matterport3D, Video2Scene significantly outperforms state-of-the-art methods, achieving substantial improvements in PSNR, LPIPS, and Chamfer Distance—quantifying both visual fidelity and geometric accuracy. Qualitative results further confirm that reconstructed scenes exhibit high geometric precision and photorealistic texture quality.

Technology Category

Application Category

📝 Abstract
Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.
Problem

Research questions and friction points this paper is trying to address.

Addresses geometric distortions in image-to-3D scene generation
Enhances geometric consistency using multi-frame geometry features
Improves fidelity of 3D scenes from single images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates consecutive video frames for geometry extraction
Uses geometry features as conditions for video generation
Introduces geometry alignment loss and adaptation module
🔎 Similar Papers
No similar papers found.
Y
Yuhao Wan
VCIP, School of Computer Science, Nankai University
L
Lijuan Liu
ByteDance Inc.
J
Jingzhi Zhou
VCIP, School of Computer Science, Nankai University
Z
Zihan Zhou
Renmin University of China
X
Xuying Zhang
VCIP, School of Computer Science, Nankai University
D
Dongbo Zhang
ByteDance Inc.
Shaohui Jiao
Shaohui Jiao
Unknown affiliation
Qibin Hou
Qibin Hou
Nankai University
Deep learningComputer visionVisual attention
Ming-Ming Cheng
Ming-Ming Cheng
Professor of Computer Science, Nankai University
Computer VisionComputer GraphicsVisual AttentionSaliency