๐ค AI Summary
To address the challenges of labor-intensive manual design, suboptimal random layout generation, and significant task-layout semantic gaps in task-driven desktop scene generation, this paper proposes the first task-oriented 3D desktop scene generation framework. Methodologically, it introduces a novel task paradigm and establishes MesaTask-10Kโa large-scale, high-quality dataset; designs a spatial reasoning chain that decomposes generation into three sequential stagesโobject inference, spatial relation modeling, and scene graph construction; and integrates large language models with Direct Preference Optimization (DPO) to enable end-to-end mapping from natural-language task descriptions to physically plausible 3D layouts. Experimental results demonstrate that our approach significantly outperforms existing baselines in task consistency, spatial plausibility, and cross-task generalization, establishing new state-of-the-art performance on desktop scene synthesis under task constraints.
๐ Abstract
The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/