🤖 AI Summary
To address weak cross-task and cross-object generalization and heavy reliance on large-scale training data in general-purpose robotic manipulation, this paper proposes a zero-shot manipulation framework grounded in geometric constraints as a universal interface. Methodologically, it models task-implicit object-part relationships (e.g., “knife edge perpendicular to carrot axis”) as interpretable symbolic constraints, integrating large foundation models for constraint generation, a symbolic geometric parser for semantic interpretation, and an optimization-driven trajectory solver for action execution—bypassing end-to-end vision-language-action joint training. Key contributions include: (i) the first geometric-constraint-driven zero-shot mapping mechanism enabling immediate natural language-to-action translation; (ii) context-aware adaptation, failure-feedback learning, long-horizon planning, and efficient imitation-data collection. Evaluated in both simulation and real-world settings, the framework achieves state-of-the-art performance, significantly improves out-of-distribution generalization, and eliminates costly retraining.
📝 Abstract
We present GeoManip, a framework to enable generalist robots to leverage essential conditions derived from object and part relationships, as geometric constraints, for robot manipulation. For example, cutting the carrot requires adhering to a geometric constraint: the blade of the knife should be perpendicular to the carrot's direction. By interpreting these constraints through symbolic language representations and translating them into low-level actions, GeoManip bridges the gap between natural language and robotic execution, enabling greater generalizability across diverse even unseen tasks, objects, and scenarios. Unlike vision-language-action models that require extensive training, operates training-free by utilizing large foundational models: a constraint generation module that predicts stage-specific geometric constraints and a geometry parser that identifies object parts involved in these constraints. A solver then optimizes trajectories to satisfy inferred constraints from task descriptions and the scene. Furthermore, GeoManip learns in-context and provides five appealing human-robot interaction features: on-the-fly policy adaptation, learning from human demonstrations, learning from failure cases, long-horizon action planning, and efficient data collection for imitation learning. Extensive evaluations on both simulations and real-world scenarios demonstrate GeoManip's state-of-the-art performance, with superior out-of-distribution generalization while avoiding costly model training.