Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints

📅 2024-11-13

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Traditional task and motion planning (TAMP) systems struggle to directly parse and execute natural language goals in open-world robotic manipulation. Method: We propose the first end-to-end approach that deeply integrates vision-language models (VLMs) into the TAMP framework. Our method leverages VLMs to jointly interpret language instructions and visual observations, generating language-parameterized plans encoded as discrete temporal logic formulas coupled with continuous geometric constraints—enabling zero-shot task generalization. We further introduce partial plan reasoning and continuous constraint resolution to close the loop from natural language to high-fidelity robot motion. Results: Evaluated in simulation and on real robotic arms, our method successfully executes diverse unseen language instructions without any task-specific training, significantly improving robustness and deployability of language-driven manipulation in open-world settings.

Technology Category

Application Category

📝 Abstract

Foundation models trained on internet-scale data, such as Vision-Language Models (VLMs), excel at performing tasks involving common sense, such as visual question answering. Despite their impressive capabilities, these models cannot currently be directly applied to challenging robot manipulation problems that require complex and precise continuous reasoning. Task and Motion Planning (TAMP) systems can control high-dimensional continuous systems over long horizons through combining traditional primitive robot operations. However, these systems require detailed model of how the robot can impact its environment, preventing them from directly interpreting and addressing novel human objectives, for example, an arbitrary natural language goal. We propose deploying VLMs within TAMP systems by having them generate discrete and continuous language-parameterized constraints that enable TAMP to reason about open-world concepts. Specifically, we propose algorithms for VLM partial planning that constrain a TAMP system's discrete temporal search and VLM continuous constraints interpretation to augment the traditional manipulation constraints that TAMP systems seek to satisfy. We demonstrate our approach on two robot embodiments, including a real world robot, across several manipulation tasks, where the desired objectives are conveyed solely through language.

Problem

Research questions and friction points this paper is trying to address.

Integrate Vision-Language Models into Task and Motion Planning for open-world manipulation

Enable robots to interpret and address novel natural language goals

Generate discrete and continuous constraints for long-horizon manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs generate constraints for TAMP systems

VLM partial planning constrains temporal search

Continuous constraints augment manipulation constraints

🔎 Similar Papers

Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video