🤖 AI Summary
This work addresses the challenges of multi-view consistency and layout controllability in novel view synthesis (NVS) for 3D indoor scenes. We propose MVRoom, a controllable 3D scene generation method based on a multi-view diffusion model. Conditioning on coarse 3D layouts—e.g., bounding boxes or semantic planes—it employs a two-stage framework: first generating layout-aligned multi-view features, then jointly optimizing image details. Its core innovation is a layout-aware epipolar attention mechanism that explicitly enforces cross-view geometric constraints. MVRoom supports text-driven generation, complexity-controllable output, and iterative scene expansion. Quantitative and qualitative evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art methods, achieving both high-fidelity rendering and strong multi-view consistency.
📝 Abstract
We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.