MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

๐Ÿ“… 2025-09-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges of labor-intensive manual design, suboptimal random layout generation, and significant task-layout semantic gaps in task-driven desktop scene generation, this paper proposes the first task-oriented 3D desktop scene generation framework. Methodologically, it introduces a novel task paradigm and establishes MesaTask-10Kโ€”a large-scale, high-quality dataset; designs a spatial reasoning chain that decomposes generation into three sequential stagesโ€”object inference, spatial relation modeling, and scene graph construction; and integrates large language models with Direct Preference Optimization (DPO) to enable end-to-end mapping from natural-language task descriptions to physically plausible 3D layouts. Experimental results demonstrate that our approach significantly outperforms existing baselines in task consistency, spatial plausibility, and cross-task generalization, establishing new state-of-the-art performance on desktop scene synthesis under task constraints.

Technology Category

Application Category

๐Ÿ“ Abstract
The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/
Problem

Research questions and friction points this paper is trying to address.

Generating task-relevant tabletop scenes for robot training
Bridging high-level task instructions with 3D spatial layouts
Creating physically plausible scenes with realistic object relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Reasoning Chain decomposes generation process
LLM-based framework enhanced with DPO algorithms
Generates physically plausible task-aligned tabletop scenes
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jinkun Hao
Shanghai Jiao Tong University
N
Naifu Liang
Shanghai AI Laboratory
Z
Zhen Luo
SII, Southern University of Science and Technology
X
Xudong Xu
Shanghai AI Laboratory
W
Weipeng Zhong
Shanghai Jiao Tong University
Ran Yi
Ran Yi
Associate Professor, Shanghai Jiao Tong University
Computer VisionComputer Graphics
Y
Yichen Jin
Peking University
Zhaoyang Lyu
Zhaoyang Lyu
PhD of Information Engineering, The Chinese University of Hong Kong
machine learning
F
Feng Zheng
Southern University of Science and Technology
L
Lizhuang Ma
Shanghai Jiao Tong University
J
Jiangmiao Pang
Shanghai AI Laboratory