HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing 3D scene generation methods are either labor-intensive and inefficient or, when data-driven, struggle to simultaneously ensure semantic plausibility, physical consistency, and real-time editability. This work proposes a hierarchical generation and editing framework that synergistically integrates large language models (LLMs) and vision-language models (VLMs). It pioneers the combination of retrieval-augmented generation (RAG) with hierarchical scene representations, leveraging RAG to enhance semantic coherence, incorporating optimization modules to enforce physical consistency, and exploiting the hierarchical structure to enable efficient inference and interactive editing. Experimental results demonstrate that the proposed approach outperforms existing baselines in both diversity and plausibility of generated scenes while significantly accelerating 3D content creation workflows.

Technology Category

Application Category

📝 Abstract

3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.

Problem

Research questions and friction points this paper is trying to address.

3D layout generation

scene editing

Embodied AI

immersive VR

semantic consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical 3D Scene Generation

Retrieval-Augmented Generation (RAG)

Vision-Language Models