ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Single-image 3D scene reconstruction faces a fundamental trade-off between asset quality and scene coherence, while texture editing suffers from insufficient local continuity and multi-view consistency. Method: We propose the first zero-shot framework for single-image 3D scene generation and controllable texture editing, leveraging large vision model (LVM) priors. It (1) jointly infers spatial layout via 2D segmentation and depth estimation; (2) employs mask-guided progressive diffusion generation coupled with 2D-prior-driven joint point cloud optimization to simultaneously minimize 3D geometric and 2D projection losses; and (3) integrates physically based rendering (PBR) modeling and multi-view consistency constraints for photorealistic texture editing. Contribution/Results: Extensive experiments demonstrate significant improvements over state-of-the-art methods in geometric accuracy, structural fidelity, texture detail, and cross-view consistency. Generated scenes exhibit strong alignment with textual prompts, achieving both high-fidelity geometry and semantically coherent, editable textures.

Technology Category

Application Category

📝 Abstract

In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts.

Problem

Research questions and friction points this paper is trying to address.

Achieving coherent 3D scene reconstruction from single images in complex environments

Maintaining texture consistency across multiple viewpoints during editing

Ensuring both geometric accuracy and realistic appearance in generated assets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large vision models for zero-shot 3D reconstruction

Jointly optimizes 3D and 2D losses for precise scene alignment

Uses diffusion constraints and mask guidance for texture editing

🔎 Similar Papers

No similar papers found.