MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of multi-view consistency and layout controllability in novel view synthesis (NVS) for 3D indoor scenes. We propose MVRoom, a controllable 3D scene generation method based on a multi-view diffusion model. Conditioning on coarse 3D layouts—e.g., bounding boxes or semantic planes—it employs a two-stage framework: first generating layout-aligned multi-view features, then jointly optimizing image details. Its core innovation is a layout-aware epipolar attention mechanism that explicitly enforces cross-view geometric constraints. MVRoom supports text-driven generation, complexity-controllable output, and iterative scene expansion. Quantitative and qualitative evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art methods, achieving both high-fidelity rendering and strong multi-view consistency.

Technology Category

Application Category

📝 Abstract
We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.
Problem

Research questions and friction points this paper is trying to address.

Generates controllable 3D indoor scenes from layouts
Ensures multi-view consistency in novel view synthesis
Supports text-to-scene generation with iterative refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pipeline using multi-view diffusion models
Layout-aware epipolar attention for multi-view consistency
Iterative framework for recursive text-to-scene generation
🔎 Similar Papers
No similar papers found.