Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the limited generalization capability of hierarchical policies in robotic 3D manipulation—particularly their difficulty adapting to novel instructions and environmental variations—this paper proposes a language-aligned coarse-to-fine manipulation framework. Guided by natural language, the framework decomposes tasks into semantic levels and introduces language-aligned 3D keypoints as interpretable, transferable intermediate representations. By jointly fine-tuning vision-language models (VLMs) and incorporating 3D-perception-aware encoding, it enables cross-scene policy transfer. Its key innovation lies in formalizing language–3D keypoint alignment as a unified interface for strategy generalization—the first such formulation in the literature. On the GemBench benchmark, the method achieves a 12% higher average success rate than the state-of-the-art (SOTA) using only one-fifth of the training data. In real-world settings, it generalizes to unseen tasks with as few as ten demonstrations.

Technology Category

Application Category

📝 Abstract

Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.

Problem

Research questions and friction points this paper is trying to address.

Improving generalization in hierarchical robot manipulation policies

Enhancing adaptation to novel instructions and environment variations

Increasing sample efficiency with coarse-to-fine 3D keypoint prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical coarse-to-fine policy for manipulation

VLM fine-tuned for 3D keypoint prediction

3D-aware representation for generalization enhancement

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey