Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

📅 2024-04-16

🏛️ arXiv.org

📈 Citations: 18

✨ Influential: 1

🤖 AI Summary

To address the challenge of enabling autonomous robots in open environments to perform dynamic reasoning and replanning via closed-loop feedback, this paper introduces the first vision-language-driven closed-loop mobile operating system based on GPT-4V. Methodologically, it integrates multi-level open-vocabulary perception and contextual reasoning modules with iterative closed-loop feedback, failure root-cause diagnosis, and recovery mechanisms—unifying 3D environment perception, embodied reasoning, execution monitoring, and dynamic replanning. The core contribution lies in deeply embedding large language models’ open-ended semantic understanding into the physical closed-loop control pipeline, thereby enabling commonsense-guided exploration and robust failure recovery. Evaluated on eight real-world mobile and tabletop manipulation tasks, the system achieves a ~35% improvement in task success rate over state-of-the-art approaches, while demonstrating strong capabilities in free-form instruction parsing and long-horizon task planning.

Technology Category

Application Category

📝 Abstract

Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

Problem

Research questions and friction points this paper is trying to address.

Autonomous robot navigation and manipulation in open environments.

Closed-loop feedback for reasoning and adaptive planning.

Improving task success rate with robust failure recovery.

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4V vision-language model for open-ended reasoning

Multi-level open-vocabulary perception and situated reasoning

Iterative closed-loop feedback for robust failure recovery

🔎 Similar Papers

No similar papers found.

Authors to Follow