Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited generalization capability of hierarchical policies in robotic 3D manipulation—particularly their difficulty adapting to novel instructions and environmental variations—this paper proposes a language-aligned coarse-to-fine manipulation framework. Guided by natural language, the framework decomposes tasks into semantic levels and introduces language-aligned 3D keypoints as interpretable, transferable intermediate representations. By jointly fine-tuning vision-language models (VLMs) and incorporating 3D-perception-aware encoding, it enables cross-scene policy transfer. Its key innovation lies in formalizing language–3D keypoint alignment as a unified interface for strategy generalization—the first such formulation in the literature. On the GemBench benchmark, the method achieves a 12% higher average success rate than the state-of-the-art (SOTA) using only one-fifth of the training data. In real-world settings, it generalizes to unseen tasks with as few as ten demonstrations.

Technology Category

Application Category

📝 Abstract
Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization in hierarchical robot manipulation policies
Enhancing adaptation to novel instructions and environment variations
Increasing sample efficiency with coarse-to-fine 3D keypoint prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical coarse-to-fine policy for manipulation
VLM fine-tuned for 3D keypoint prediction
3D-aware representation for generalization enhancement
🔎 Similar Papers
No similar papers found.
Jianshu Hu
Jianshu Hu
Shanghai Jiao Tong University
Reinforcement LearningRobotics
L
Lidi Wang
Global College, Shanghai Jiao Tong University
S
Shujia Li
Global College, Shanghai Jiao Tong University
Y
Yunpeng Jiang
Global College, Shanghai Jiao Tong University
Y
Yutong Ban
Global College, Shanghai Jiao Tong University
X
Xiao Li
School of Mechanical Engineering, Shanghai Jiao Tong University
Paul Weng
Paul Weng
Duke Kunshan University
Artificial IntelligenceReinforcement Learning/Markov Decision ProcessQualitative/Ordinal Models