Text-to-Scene with Large Reasoning Models

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-3D scene generation methods exhibit notable limitations in modeling complex geometry, performing object transformations, and adhering to fine-grained textual instructions. To address these challenges, we propose Reason-3D—a novel framework that pioneers the deep integration of Large Reasoning Models (LRMs) into the 3D scene generation pipeline. Reason-3D employs multi-dimensional captioning—incorporating physical, functional, and contextual attributes—to drive precise object retrieval, and combines implicit/explicit layout constraints with collision-aware spatial reasoning for accurate object localization and layout planning. Compared to state-of-the-art approaches, Reason-3D achieves significant improvements in three core metrics: visual fidelity, instruction-following accuracy, and asset retrieval quality. Our results empirically validate the efficacy and generalizability of LRMs for structured 3D spatial reasoning, establishing a new paradigm for controllable, interpretable, and text-driven 3D scene generation.

Technology Category

Application Category

📝 Abstract
Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.
Problem

Research questions and friction points this paper is trying to address.

Generating 3D scenes from text with complex geometries
Improving adherence to complex instructions in text-to-scene
Enhancing object retrieval and spatial reasoning in scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates object retrieval using multi-attribute captions
Places objects based on implicit and explicit constraints
Refines positions with collision-aware spatial reasoning
🔎 Similar Papers
No similar papers found.