🤖 AI Summary
To address the limited generalization capability of hierarchical policies in robotic 3D manipulation—particularly their difficulty adapting to novel instructions and environmental variations—this paper proposes a language-aligned coarse-to-fine manipulation framework. Guided by natural language, the framework decomposes tasks into semantic levels and introduces language-aligned 3D keypoints as interpretable, transferable intermediate representations. By jointly fine-tuning vision-language models (VLMs) and incorporating 3D-perception-aware encoding, it enables cross-scene policy transfer. Its key innovation lies in formalizing language–3D keypoint alignment as a unified interface for strategy generalization—the first such formulation in the literature. On the GemBench benchmark, the method achieves a 12% higher average success rate than the state-of-the-art (SOTA) using only one-fifth of the training data. In real-world settings, it generalizes to unseen tasks with as few as ten demonstrations.
📝 Abstract
Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.