GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation

πŸ“… 2025-06-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Real-world robotic manipulation exhibits limited robust generalization under varying instructions and scene configurations, while existing simulation platforms lack the fidelity and standardization required for fair evaluation of foundational models (e.g., LLMs). To address this, we propose GenManip: the first photorealistic desktop manipulation simulation platform explicitly designed for rigorous generalization benchmarking. Its core contributions include: (1) a novel LLM-driven framework for automatic synthesis of task-scene graphs; (2) GenManip-Benchβ€”a human-in-the-loop refined benchmark comprising 200 precisely annotated scenes; and (3) a modular perception-reasoning-planning architecture. Experiments demonstrate that our modular system significantly outperforms end-to-end approaches on unseen instruction-scene combinations, achieving substantial gains in zero-shot generalization. We open-source 10K annotated 3D object assets and an extensible training pipeline, establishing a new paradigm for evaluating generalization in embodied intelligence.

Technology Category

Application Category

πŸ“ Abstract
Robotic manipulation in real-world settings remains challenging, especially regarding robust generalization. Existing simulation platforms lack sufficient support for exploring how policies adapt to varied instructions and scenarios. Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair comparisons. To bridge this gap, we introduce GenManip, a realistic tabletop simulation platform tailored for policy generalization studies. It features an automatic pipeline via LLM-driven task-oriented scene graph to synthesize large-scale, diverse tasks using 10K annotated 3D object assets. To systematically assess generalization, we present GenManip-Bench, a benchmark of 200 scenarios refined via human-in-the-loop corrections. We evaluate two policy types: (1) modular manipulation systems integrating foundation models for perception, reasoning, and planning, and (2) end-to-end policies trained through scalable data collection. Results show that while data scaling benefits end-to-end methods, modular systems enhanced with foundation models generalize more effectively across diverse scenarios. We anticipate this platform to facilitate critical insights for advancing policy generalization in realistic conditions. Project Page: https://genmanip.axi404.top/.
Problem

Research questions and friction points this paper is trying to address.

Robotic manipulation lacks robust generalization in real-world settings
Existing platforms fail to support policy adaptation to varied instructions
Instruction-following foundation models' adaptability remains underexplored
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven task-oriented scene graph
Large-scale diverse task synthesis
Human-in-the-loop benchmark refinement
πŸ”Ž Similar Papers
No similar papers found.
N
Ning Gao
Shanghai AI Laboratory, Xi’an Jiaotong University
Y
Yilun Chen
Shanghai AI Laboratory
S
Shuai Yang
Shanghai AI Laboratory, Zhejiang University
X
Xinyi Chen
Shanghai AI Laboratory, Nanjing University
Y
Yang Tian
Shanghai AI Laboratory
H
Hao Li
Shanghai AI Laboratory
Haifeng Huang
Haifeng Huang
Iowa State University
Computer VisionMulti-modal Learning
H
Hanqing Wang
Shanghai AI Laboratory
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
J
Jiangmiao Pang
Shanghai AI Laboratory