🤖 AI Summary
Existing 3D scene generation methods are either labor-intensive and inefficient or, when data-driven, struggle to simultaneously ensure semantic plausibility, physical consistency, and real-time editability. This work proposes a hierarchical generation and editing framework that synergistically integrates large language models (LLMs) and vision-language models (VLMs). It pioneers the combination of retrieval-augmented generation (RAG) with hierarchical scene representations, leveraging RAG to enhance semantic coherence, incorporating optimization modules to enforce physical consistency, and exploiting the hierarchical structure to enable efficient inference and interactive editing. Experimental results demonstrate that the proposed approach outperforms existing baselines in both diversity and plausibility of generated scenes while significantly accelerating 3D content creation workflows.
📝 Abstract
3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.