Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current text-to-interactive 3D scene generation methods suffer from low scene diversity, poor spatial layout fidelity, high object interpenetration rates, and physically infeasible configurations—severely limiting their applicability in gaming, VR, and embodied AI. To address these limitations, we propose the first training-free embodied agent framework that synergistically integrates LLM-based semantic planning with multimodal visual-spatial guidance. Our approach introduces three core mechanisms: iterative layout optimization, physics-aware constraint alignment, and a spatial consistency discriminator. Crucially, it operates without fine-tuning any pre-trained model and enables end-to-end generation of 3D scenes exhibiting high diversity, natural spatial coherence, physical stability, and interactive readiness. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in layout plausibility, interpenetration suppression, and commonsense compliance.

Technology Category

Application Category

📝 Abstract

Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.

Problem

Research questions and friction points this paper is trying to address.

Generating diverse 3D scenes from text with limited dataset diversity

Improving spatial realism in LLM-based 3D scene layouts

Ensuring physical plausibility in generated 3D interactive scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based scene planning for initial layout

Vision-guided refinement for spatial realism

Optimization module ensures physical plausibility

🔎 Similar Papers

No similar papers found.