🤖 AI Summary
This work addresses the challenge of enabling robots to perform multi-step manipulation tasks using only RGB images and natural language instructions, without relying on extensive robot-specific training data. The authors propose the first plug-and-play, open-vocabulary manipulation planning system that integrates pretrained vision-language models with classical task and motion planners (TAMP) through a modular architecture. This approach enables zero-shot generation of multi-step manipulation plans, requiring no robot demonstration data for training and supporting rapid cross-task transfer. The system is highly deployable and extensible, demonstrating competitive or superior performance compared to the VLA model π₀.₅-DROID—which was fine-tuned on 350 hours of human demonstrations—across 28 tabletop manipulation tasks in both simulation and real-world environments.
📝 Abstract
We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $π_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io