GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing 3D scene generation methods often suffer from geometric inconsistencies, structural artifacts, and detail degradation due to limited structural modeling capacity and heavy reliance on large-scale annotated data. This work proposes a self-supervised 3D scene generation framework that leverages coarse geometry as a structural anchor and integrates a geometry-constrained 2D diffusion model to produce texture-rich reference images. By employing voxel-aligned 3D feature aggregation and a dual self-supervision mechanism, the method achieves high-quality scene synthesis without requiring strict multi-view consistency. The approach substantially reduces dependence on labeled data, demonstrates robustness against noise and inconsistencies introduced by diffusion models, and exhibits superior generalization, detail fidelity, and computational efficiency in complex scenes.

Technology Category

Application Category

📝 Abstract

3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.

Problem

Research questions and friction points this paper is trying to address.

3D scene generation

structural modeling

geometric consistency

high-frequency details

supervision dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised

geometry-constrained diffusion

3D scene generation