HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing 3D scene generation methods are either labor-intensive and inefficient or, when data-driven, struggle to simultaneously ensure semantic plausibility, physical consistency, and real-time editability. This work proposes a hierarchical generation and editing framework that synergistically integrates large language models (LLMs) and vision-language models (VLMs). It pioneers the combination of retrieval-augmented generation (RAG) with hierarchical scene representations, leveraging RAG to enhance semantic coherence, incorporating optimization modules to enforce physical consistency, and exploiting the hierarchical structure to enable efficient inference and interactive editing. Experimental results demonstrate that the proposed approach outperforms existing baselines in both diversity and plausibility of generated scenes while significantly accelerating 3D content creation workflows.

Technology Category

Application Category

📝 Abstract
3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.
Problem

Research questions and friction points this paper is trying to address.

3D layout generation
scene editing
Embodied AI
immersive VR
semantic consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical 3D Scene Generation
Retrieval-Augmented Generation (RAG)
Vision-Language Models
Real-time Scene Editing
Physical Consistency Optimization