🤖 AI Summary
This work addresses the planning challenge for long-horizon, complex robotic manipulation tasks under partial-view point clouds—tasks requiring geometric reasoning, multi-object interaction, and occluded-object inference. We propose a hierarchical planning framework integrating large language models (LLMs) with a sampling-based continuous-parameter optimizer. Our key contributions are: (1) a novel relational dynamics model trained exclusively on single-step simulation data, enabling zero-shot generalization to arbitrary-length real-world tasks; (2) a unified relational representation that bridges point-cloud perception, natural language instruction understanding, and physically feasible action generation; and (3) a geometry- and occlusion-aware point-cloud encoder coupled with multimodal prompt engineering. Evaluated on real-world long-horizon tasks, our method achieves >85% success rate—significantly surpassing the best prior baseline (50%)—and demonstrates strong generalization across challenging scenarios involving multi-object interaction, geometric reasoning, and occlusion-aware manipulation.
📝 Abstract
We present Points2Plans, a framework for composable planning with a relational dynamics model that enables robots to solve long-horizon manipulation tasks from partial-view point clouds. Given a language instruction and a point cloud of the scene, our framework initiates a hierarchical planning procedure, whereby a language model generates a high-level plan and a sampling-based planner produces constraint-satisfying continuous parameters for manipulation primitives sequenced according to the high-level plan. Key to our approach is the use of a relational dynamics model as a unifying interface between the continuous and symbolic representations of states and actions, thus facilitating language-driven planning from high-dimensional perceptual input such as point clouds. Whereas previous relational dynamics models require training on datasets of multi-step manipulation scenarios that align with the intended test scenarios, Points2Plans uses only single-step simulated training data while generalizing zero-shot to a variable number of steps during real-world evaluations. We evaluate our approach on tasks involving geometric reasoning, multi-object interactions, and occluded object reasoning in both simulated and real-world settings. Results demonstrate that Points2Plans offers strong generalization to unseen long-horizon tasks in the real world, where it solves over 85% of evaluated tasks while the next best baseline solves only 50%.