Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

📅 2024-04-16
🏛️ arXiv.org
📈 Citations: 18
Influential: 1
📄 PDF
🤖 AI Summary
To address the challenge of enabling autonomous robots in open environments to perform dynamic reasoning and replanning via closed-loop feedback, this paper introduces the first vision-language-driven closed-loop mobile operating system based on GPT-4V. Methodologically, it integrates multi-level open-vocabulary perception and contextual reasoning modules with iterative closed-loop feedback, failure root-cause diagnosis, and recovery mechanisms—unifying 3D environment perception, embodied reasoning, execution monitoring, and dynamic replanning. The core contribution lies in deeply embedding large language models’ open-ended semantic understanding into the physical closed-loop control pipeline, thereby enabling commonsense-guided exploration and robust failure recovery. Evaluated on eight real-world mobile and tabletop manipulation tasks, the system achieves a ~35% improvement in task success rate over state-of-the-art approaches, while demonstrating strong capabilities in free-form instruction parsing and long-horizon task planning.

Technology Category

Application Category

📝 Abstract
Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
Problem

Research questions and friction points this paper is trying to address.

Autonomous robot navigation and manipulation in open environments.
Closed-loop feedback for reasoning and adaptive planning.
Improving task success rate with robust failure recovery.
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4V vision-language model for open-ended reasoning
Multi-level open-vocabulary perception and situated reasoning
Iterative closed-loop feedback for robust failure recovery
🔎 Similar Papers
No similar papers found.
Peiyuan Zhi
Peiyuan Zhi
Unknown affiliation
Z
Zhiyuan Zhang
State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), Department of Automation, Tsinghua University
Muzhi Han
Muzhi Han
University of California, Los Angeles
Z
Zeyu Zhang
State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI)
Z
Zhitian Li
State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI)
Ziyuan Jiao
Ziyuan Jiao
UCLA
RoboticsTask and Motion PlanningMobile ManipulationRobotic Manipulation
Baoxiong Jia
Baoxiong Jia
Ph.D. in Computer Science, UCLA
Computer VisionArtificial Intelligence
S
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI)