🤖 AI Summary
To address challenges in text-to-3D complex scene generation—including difficulty modeling multi-object interactions, poor layout coherence, and cross-object appearance leakage—this paper proposes GraLa3D, a novel graph-based framework. GraLa3D introduces a structured scene graph comprising individual object nodes and composite hyper-nodes to explicitly encode spatial and semantic relationships among objects. It integrates layout-aware bounding box constraints into the graph structure and synergistically combines LLM-driven scene understanding, graph encoding, and layout-aware 3D diffusion optimization. Crucially, it departs from conventional score-distillation sampling (SDS) paradigms to enable multi-object co-manipulation and relation-guided generation. Experiments demonstrate that GraLa3D achieves state-of-the-art performance in text alignment, structural plausibility, and fine-grained appearance control, significantly improving the fidelity and compositional quality of multi-object 3D scenes.
📝 Abstract
Recent advancements in object-centric text-to-3D generation have shown impressive results. However, generating complex 3D scenes remains an open challenge due to the intricate relations between objects. Moreover, existing methods are largely based on score distillation sampling (SDS), which constrains the ability to manipulate multiobjects with specific interactions. Addressing these critical yet underexplored issues, we present a novel framework of Scene Graph and Layout Guided 3D Scene Generation (GraLa3D). Given a text prompt describing a complex 3D scene, GraLa3D utilizes LLM to model the scene using a scene graph representation with layout bounding box information. GraLa3D uniquely constructs the scene graph with single-object nodes and composite super-nodes. In addition to constraining 3D generation within the desirable layout, a major contribution lies in the modeling of interactions between objects in a super-node, while alleviating appearance leakage across objects within such nodes. Our experiments confirm that GraLa3D overcomes the above limitations and generates complex 3D scenes closely aligned with text prompts.