GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing text-to-3D indoor scene generation methods suffer from three key limitations: (1) end-to-end generative models neglect scene graph structure, leading to layout distortions; (2) vision-language model (VLM)-based approaches incur high computational overhead, hindering deployment on resource-constrained devices; and (3) reliance on manually annotated semantic graphs or ground-truth relational labels restricts generalization and interactive modeling capability. This paper proposes the first relation-supervision-free, text-driven 3D scene generation framework. It introduces textual conditioning into equivariant graph neural networks (EGNNs) within a diffusion-based architecture, jointly modeling graph topology, geometric symmetries, and linguistic semantics. The method enables end-to-end text-to-3D layout synthesis without predefined relations or user input. Experiments demonstrate competitive performance against state-of-the-art methods requiring relational supervision, substantial gains over purely generative baselines, and superior scene coherence and spatial plausibility—while maintaining lightweight design suitable for edge deployment.

Technology Category

Application Category

📝 Abstract

Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing 3D indoor scenes from text prompts without predefined relationship annotations

Addressing scene coherence limitations by leveraging geometric symmetries and graph structures

Enabling text-conditioned 3D generation for resource-constrained devices through efficient modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages geometric symmetries of 3D scenes

Uses equivariant graph neural networks (EGNNs)

Conditions EGNNs on text features effectively

🔎 Similar Papers

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion