🤖 AI Summary
Existing text-driven 3D indoor scene synthesis methods lack fine-grained controllability, while graph-driven approaches require labor-intensive manual scene graph construction, hindering both control and usability.
Method: We propose MG-DiT, a hybrid graph diffusion Transformer architecture that unifies free-form text and/or reference images into structured scene graphs via a vision-language model (VLM)-driven Graph Designer—eliminating manual graph annotation. MG-DiT integrates graph-aware denoising with multimodal conditional modeling to jointly support text-to-scene generation, graph-to-scene synthesis, and scene rearrangement.
Contribution/Results: A single unified model achieves state-of-the-art performance across all three tasks without task-specific architectures or manual graph design. Quantitative and qualitative evaluations demonstrate superior generation quality and precise structural control over prior methods. By automating scene graph generation and supporting intuitive multimodal inputs, MG-DiT significantly lowers the user interaction barrier for controllable 3D scene synthesis.
📝 Abstract
Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene synthesis.Specifically, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.