🤖 AI Summary
In single-view 2D indoor scene reconstruction, severe depth ambiguity and complex instance occlusion lead to geometric distortion and incomplete layout estimation. To address these challenges, this paper proposes a two-stage decoupled framework: first performing modality-agnostic instance completion and layout refinement, then generating geometric structure. Key contributions include: (1) the first diffusion-based modular architecture for indoor scene reconstruction; (2) modality-agnostic instance completion tailored to indoor scenes; (3) layout-aware dedicated layout refinement; and (4) a hybrid depth estimation scheme coupled with 2D/3D joint view alignment. Evaluated on the 3D-Front dataset, our method significantly outperforms state-of-the-art approaches in both geometric accuracy (e.g., Chamfer distance, F-Score) and visual realism (e.g., LPIPS, FID). The resulting high-fidelity reconstructions support practical applications in interior design, real estate visualization, and augmented reality.
📝 Abstract
We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.