🤖 AI Summary
In open-world robotic manipulation, cross-object skill transfer suffers from a fundamental tension between high-level structural disparities and low-level control consistency. To address this, we propose a semantics-aligned zero-shot skill generalization framework that decouples skills into priority-task-axis controllers anchored on object keypoints and geometric axes. We introduce the Grounded Task-Axis (GTA) controller paradigm—the first of its kind—enabling structured skill decomposition and semantic-level zero-shot transfer. Furthermore, we pioneer the use of a vision foundation model (SD-DINO) to drive cross-object keypoint semantic alignment and controller redeployment. By integrating keypoint-axis geometric grounding with example-guided transfer, our method successfully executes diverse real-robot tasks—including screw tightening, liquid pouring, and scraping—on previously unseen objects. Experiments demonstrate substantial improvements in robustness to novel objects and generalization across manipulation tasks.
📝 Abstract
Transferring skills between different objects remains one of the core challenges of open-world robot manipulation. Generalization needs to take into account the high-level structural differences between distinct objects while still maintaining similar low-level interaction control. In this paper, we propose an example-based zero-shot approach to skill transfer. Rather than treating skills as atomic, we decompose skills into a prioritized list of grounded task-axis (GTA) controllers. Each GTAC defines an adaptable controller, such as a position or force controller, along an axis. Importantly, the GTACs are grounded in object key points and axes, e.g., the relative position of a screw head or the axis of its shaft. Zero-shot transfer is thus achieved by finding semantically-similar grounding features on novel target objects. We achieve this example-based grounding of the skills through the use of foundation models, such as SD-DINO, that can detect semantically similar keypoints of objects. We evaluate our framework on real-robot experiments, including screwing, pouring, and spatula scraping tasks, and demonstrate robust and versatile controller transfer for each.