LLM-GROP: Visually grounded robot task and motion planning with large language models

📅 2025-10-01

🏛️ The international journal of robotics research

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses ambiguous multi-object arrangement tasks in mobile manipulation (MoMa), such as “setting a dinner table,” where target configurations are underspecified. Method: We propose a Task and Motion Planning (TAMP) framework integrating Large Language Model (LLM)-driven commonsense reasoning with vision-guided base pose optimization. The LLM performs high-level task decomposition and generates semantically plausible object poses grounded in everyday knowledge; a vision module learns optimal robot base poses to ensure reachability and collision avoidance; task and motion planning execute interleaved to guarantee action feasibility. Contribution/Results: To our knowledge, this is the first TAMP approach to explicitly embed LLM-derived semantic commonsense into the planning pipeline and realize a closed-loop synergy among vision, language, and motion. Evaluated on long-horizon object rearrangement in both simulation and real-world settings, our method achieves an 84.4% success rate in physical experiments. User studies indicate performance comparable to human servers, significantly improving generalization to unspecified target configurations.

Technology Category

Application Category

📝 Abstract

Task planning and motion planning are two of the most important problems in robotics, where task planning methods help robots achieve high-level goals and motion planning methods maintain low-level feasibility. Task and motion planning (TAMP) methods interleave the two processes of task planning and motion planning to ensure goal achievement and motion feasibility. Within the TAMP context, we are concerned with the mobile manipulation (MoMa) of multiple objects, where it is necessary to interleave actions for navigation and manipulation. In particular, we aim to compute where and how each object should be placed given underspecified goals, such as “set up dinner table with a fork, knife and plate.” We leverage the rich common sense knowledge from large language models (LLMs), for example, about how tableware is organized, to facilitate both task-level and motion-level planning. In addition, we use computer vision methods to learn a strategy for selecting base positions to facilitate MoMa behaviors, where the base position corresponds to the robot’s “footprint” and orientation in its operating space. Altogether, this article provides a principled TAMP framework for MoMa tasks that accounts for common sense about object rearrangement and is adaptive to novel situations that include many objects that need to be moved. We performed quantitative experiments in both real-world settings and simulated environments. We evaluated the success rate and efficiency in completing long-horizon object rearrangement tasks. While the robot completed 84.4% real-world object rearrangement trials, subjective human evaluations indicated that the robot’s performance is still lower than experienced human waiters.

Problem

Research questions and friction points this paper is trying to address.

Planning mobile manipulation tasks with underspecified human instructions

Integrating common sense knowledge for object rearrangement strategies

Coordinating navigation and manipulation actions for multi-object placement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs for task and motion planning

Using vision to select optimal robot base positions

Integrating common sense knowledge for object rearrangement

🔎 Similar Papers

DELTA: Decomposed Efficient Long-Term Robot Task Planning using Large Language Models