Grounded Task Axes: Zero-Shot Semantic Skill Generalization via Task-Axis Controllers and Visual Foundation Models

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-world robotic manipulation, cross-object skill transfer suffers from a fundamental tension between high-level structural disparities and low-level control consistency. To address this, we propose a semantics-aligned zero-shot skill generalization framework that decouples skills into priority-task-axis controllers anchored on object keypoints and geometric axes. We introduce the Grounded Task-Axis (GTA) controller paradigm—the first of its kind—enabling structured skill decomposition and semantic-level zero-shot transfer. Furthermore, we pioneer the use of a vision foundation model (SD-DINO) to drive cross-object keypoint semantic alignment and controller redeployment. By integrating keypoint-axis geometric grounding with example-guided transfer, our method successfully executes diverse real-robot tasks—including screw tightening, liquid pouring, and scraping—on previously unseen objects. Experiments demonstrate substantial improvements in robustness to novel objects and generalization across manipulation tasks.

Technology Category

Application Category

📝 Abstract
Transferring skills between different objects remains one of the core challenges of open-world robot manipulation. Generalization needs to take into account the high-level structural differences between distinct objects while still maintaining similar low-level interaction control. In this paper, we propose an example-based zero-shot approach to skill transfer. Rather than treating skills as atomic, we decompose skills into a prioritized list of grounded task-axis (GTA) controllers. Each GTAC defines an adaptable controller, such as a position or force controller, along an axis. Importantly, the GTACs are grounded in object key points and axes, e.g., the relative position of a screw head or the axis of its shaft. Zero-shot transfer is thus achieved by finding semantically-similar grounding features on novel target objects. We achieve this example-based grounding of the skills through the use of foundation models, such as SD-DINO, that can detect semantically similar keypoints of objects. We evaluate our framework on real-robot experiments, including screwing, pouring, and spatula scraping tasks, and demonstrate robust and versatile controller transfer for each.
Problem

Research questions and friction points this paper is trying to address.

Transferring robot skills between different objects
Decomposing skills into adaptable task-axis controllers
Using foundation models for zero-shot semantic skill transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose skills into grounded task-axis controllers
Use foundation models for semantic keypoint detection
Achieve zero-shot transfer via semantically-similar features
🔎 Similar Papers
No similar papers found.