SpatialLM: Training Large Language Models for Structured Indoor Modeling

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work introduces the first multimodal large language model (MLLM) tailored for indoor 3D scene understanding, directly processing raw point clouds to output structured semantic layouts (e.g., walls, doors, windows) and oriented 3D object bounding boxes. Methodologically, it pioneers the adaptation of standard MLLM architectures—rather than custom-designed networks—to structured indoor modeling; constructs a large-scale synthetic dataset comprising 12,328 point cloud–ground-truth scene pairs; and employs end-to-end joint fine-tuning to integrate a point cloud encoder with an open-source LLM text decoder. Evaluated on public benchmarks, the model achieves state-of-the-art performance in layout estimation and competitive results in 3D object detection. By unifying perception and reasoning in a single scalable framework, it significantly enhances spatial understanding capabilities for augmented reality and embodied AI systems.

Technology Category

Application Category

📝 Abstract

SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

Problem

Research questions and friction points this paper is trying to address.

Process 3D point cloud data for structured scene understanding

Generate architectural elements and semantic object categories

Enhance spatial understanding in augmented reality and robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large language model processes 3D point cloud data

Fine-tuned from open-source LLMs for scene understanding

Uses synthetic dataset for training and benchmarking

🔎 Similar Papers

No similar papers found.