🤖 AI Summary
Existing text-to-3D scene generation methods exhibit notable limitations in modeling complex geometry, performing object transformations, and adhering to fine-grained textual instructions. To address these challenges, we propose Reason-3D—a novel framework that pioneers the deep integration of Large Reasoning Models (LRMs) into the 3D scene generation pipeline. Reason-3D employs multi-dimensional captioning—incorporating physical, functional, and contextual attributes—to drive precise object retrieval, and combines implicit/explicit layout constraints with collision-aware spatial reasoning for accurate object localization and layout planning. Compared to state-of-the-art approaches, Reason-3D achieves significant improvements in three core metrics: visual fidelity, instruction-following accuracy, and asset retrieval quality. Our results empirically validate the efficacy and generalizability of LRMs for structured 3D spatial reasoning, establishing a new paradigm for controllable, interpretable, and text-driven 3D scene generation.
📝 Abstract
Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.