MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing embodied AI benchmarks are largely confined to single-floor indoor environments, limiting their capacity to support spatial reasoning required for cross-floor, long-horizon tasks. To address this gap, this work proposes the first language-driven, building-scale multi-floor 3D scene generation framework that integrates architectural priors, language-to-3D generation, open-vocabulary semantic editing, and navigability constraints to enable diverse, vertically structured scene synthesis. Leveraging this framework, we introduce MansionWorld, a dataset comprising over 1,000 multi-story buildings. Benchmark evaluations demonstrate a significant performance drop in state-of-the-art agents, underscoring the dataset’s effectiveness in presenting a more challenging and evaluatively meaningful testbed for embodied intelligence.

Technology Category

Application Category

πŸ“ Abstract
Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.
Problem

Research questions and friction points this paper is trying to address.

multi-floor
long-horizon tasks
3D scene generation
spatial reasoning
embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-to-3D
multi-floor generation
spatial reasoning
embodied AI
scene editing
πŸ”Ž Similar Papers
No similar papers found.