MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of labor-intensive manual design, suboptimal random layout generation, and significant task-layout semantic gaps in task-driven desktop scene generation, this paper proposes the first task-oriented 3D desktop scene generation framework. Methodologically, it introduces a novel task paradigm and establishes MesaTask-10K—a large-scale, high-quality dataset; designs a spatial reasoning chain that decomposes generation into three sequential stages—object inference, spatial relation modeling, and scene graph construction; and integrates large language models with Direct Preference Optimization (DPO) to enable end-to-end mapping from natural-language task descriptions to physically plausible 3D layouts. Experimental results demonstrate that our approach significantly outperforms existing baselines in task consistency, spatial plausibility, and cross-task generalization, establishing new state-of-the-art performance on desktop scene synthesis under task constraints.

Technology Category

Application Category

📝 Abstract

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/

Problem

Research questions and friction points this paper is trying to address.

Generating task-relevant tabletop scenes for robot training

Bridging high-level task instructions with 3D spatial layouts

Creating physically plausible scenes with realistic object relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Reasoning Chain decomposes generation process

LLM-based framework enhanced with DPO algorithms

Generates physically plausible task-aligned tabletop scenes

🔎 Similar Papers

No similar papers found.

Authors to Follow