FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing text-driven 3D indoor scene synthesis methods lack fine-grained controllability, while graph-driven approaches require labor-intensive manual scene graph construction, hindering both control and usability. Method: We propose MG-DiT, a hybrid graph diffusion Transformer architecture that unifies free-form text and/or reference images into structured scene graphs via a vision-language model (VLM)-driven Graph Designer—eliminating manual graph annotation. MG-DiT integrates graph-aware denoising with multimodal conditional modeling to jointly support text-to-scene generation, graph-to-scene synthesis, and scene rearrangement. Contribution/Results: A single unified model achieves state-of-the-art performance across all three tasks without task-specific architectures or manual graph design. Quantitative and qualitative evaluations demonstrate superior generation quality and precise structural control over prior methods. By automating scene graph generation and supporting intuitive multimodal inputs, MG-DiT significantly lowers the user interaction barrier for controllable 3D scene synthesis.

Technology Category

Application Category

📝 Abstract

Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene synthesis.Specifically, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.

Problem

Research questions and friction points this paper is trying to address.

Enables fine-grained 3D scene synthesis from free-form user inputs

Unifies text-based and graph-based control in a single framework

Improves generation quality and controllability over existing methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-based Graph Designer for input integration

Mixed Graph Diffusion Transformer (MG-DiT) for denoising

Unified text and graph-based scene synthesis

🔎 Similar Papers

LT3SD: Latent Trees for 3D Scene Diffusion