๐ค AI Summary
Large language models (LLMs) struggle to generate semantically coherent and physically plausible 3D layouts from natural language instructions in densely constrained physical environments. To address this, we propose a vision-language model (VLM)-driven differentiable 3D layout representation framework that jointly models semantic alignment and physical feasibility in an end-to-end mannerโwithout handcrafted geometric or physical constraints. Our method integrates the VLMโs semantic understanding with differentiable optimization via a dual-path collaborative representation generation and self-consistent spatial decoding mechanism. Furthermore, we fine-tune the VLM on real-world scene data specifically for layout representation, substantially enhancing its spatial reasoning capability. Experiments demonstrate that our approach outperforms both pure LLM-based baselines and conventional constraint-solving methods in both physical plausibility and instruction adherence.
๐ Abstract
Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. However, Large Language Models (LLMs) struggle with simple tasks such as arranging 3D assets in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.