LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

๐Ÿ“… 2024-12-03
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) struggle to generate semantically coherent and physically plausible 3D layouts from natural language instructions in densely constrained physical environments. To address this, we propose a vision-language model (VLM)-driven differentiable 3D layout representation framework that jointly models semantic alignment and physical feasibility in an end-to-end mannerโ€”without handcrafted geometric or physical constraints. Our method integrates the VLMโ€™s semantic understanding with differentiable optimization via a dual-path collaborative representation generation and self-consistent spatial decoding mechanism. Furthermore, we fine-tune the VLM on real-world scene data specifically for layout representation, substantially enhancing its spatial reasoning capability. Experiments demonstrate that our approach outperforms both pure LLM-based baselines and conventional constraint-solving methods in both physical plausibility and instruction adherence.

Technology Category

Application Category

๐Ÿ“ Abstract
Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. However, Large Language Models (LLMs) struggle with simple tasks such as arranging 3D assets in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.
Problem

Research questions and friction points this paper is trying to address.

Enhances 3D spatial reasoning with Vision-Language Models.
Optimizes 3D layouts for physical plausibility and semantic alignment.
Improves VLMs' spatial planning through differentiable optimization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable optimization for 3D layout
Vision-Language Models enhance spatial reasoning
Self-consistent decoding improves spatial planning