IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation

๐Ÿ“… 2025-10-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the scarcity of high-quality, diverse training data for LLM-driven 3D indoor scene generation, this work introduces the first large-scale multimodal dataset tailored to this taskโ€”comprising 27,816 photorealistic indoor layouts and 29,215 high-fidelity 3D object assets. Each instance is annotated with fine-grained natural language descriptions and six complementary modalities: point clouds, 3D bounding boxes, multi-view RGB/depth/normal images, and semantic masks. We further provide a unified export interface and an authoritative benchmark for evaluation. Leveraging this dataset, we perform multimodal supervised fine-tuning (SFT) of LLMs, achieving significant improvements over state-of-the-art methods in cross-modal understanding and layout generation, with markedly enhanced generalization. This work establishes a foundational data resource and a scalable technical paradigm for 3D scene generation and embodied intelligence.

Technology Category

Application Category

๐Ÿ“ Abstract
In this study, we present IL3D, a large-scale dataset meticulously designed for large language model (LLM)-driven 3D scene generation, addressing the pressing demand for diverse, high-quality training data in indoor layout design. Comprising 27,816 indoor layouts across 18 prevalent room types and a library of 29,215 high-fidelity 3D object assets, IL3D is enriched with instance-level natural language annotations to support robust multimodal learning for vision-language tasks. We establish rigorous benchmarks to evaluate LLM-driven scene generation. Experimental results show that supervised fine-tuning (SFT) of LLMs on IL3D significantly improves generalization and surpasses the performance of SFT on other datasets. IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks. As a versatile and robust resource, IL3D significantly advances research in 3D scene generation and embodied intelligence, by providing high-fidelity scene data to support environment perception tasks of embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing the need for diverse indoor layout training data
Establishing benchmarks for LLM-driven 3D scene generation
Supporting multimodal learning with rich scene annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale indoor layout dataset for LLM-driven generation
Enriched with instance-level natural language annotations
Provides multimodal data exports for various visual tasks
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Wenxu Zhou
University of Science and Technology of China
K
Kaixuan Nie
Songying Technology
H
Hang Du
Songying Technology
D
Dong Yin
University of Science and Technology of China
W
Wei Huang
Songying Technology
S
Siqiang Guo
Songying Technology
X
Xiaobo Zhang
Songying Technology
Pengbo Hu
Pengbo Hu
University of Science and Technology of China
Embodied IntelligenceMulti-modal AlgorithmAutonomous AgentAgentic World Building