TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of high-fidelity desktop-scale 3D scene generation for embodied AI—specifically, the inability of existing methods to accurately model dense object layouts and complex spatial relationships from a single image or text prompt. We propose a training-free, two-stage instance-level scene synthesis framework. Methodologically, we introduce a novel decoupled pose-and-scale alignment mechanism: first, coarse layout is established via instance segmentation and completion, followed by canonical coordinate alignment; second, fine-grained refinement is achieved through differentiable rotation optimization under top-down spatial constraints, ensuring geometric consistency and physical interactivity. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in visual fidelity, layout accuracy, and simulation readiness. We further validate its effectiveness and generalizability in robot manipulation policy learning and synthetic data generation tasks.

Technology Category

Application Category

📝 Abstract
Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI--especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Generating interactive 3D tabletop scenes for robotic manipulation learning
Overcoming limitations in capturing dense layouts and spatial relations
Creating simulation-ready scenes from text or single image inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for 3D tabletop scene generation
Differentiable Rotation Optimizer for precise rotation recovery
Top-view Spatial Alignment for translation and scale estimation
🔎 Similar Papers
No similar papers found.
Z
Ziqian Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yonghao He
D-Robotics
L
Licheng Yang
School of Artificial Intelligence, University of Chinese Academy of Sciences
Wei Zou
Wei Zou
PKU、Samsung、Baidu、Didi、Ke
SpeechNLPLLMMultimodal
H
Hongxuan Ma
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences
L
Liu Liu
D-Robotics
Wei Sui
Wei Sui
Horizon Robotics
3D VisionBev Perception3D Reconstruction
Y
Yuxin Guo
School of Artificial Intelligence, University of Chinese Academy of Sciences
H
Hu Su
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences