Scaffolding Dexterous Manipulation with Vision-Language Models

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the bottleneck in dexterous manipulation—reliance on hand-crafted reward functions or expert demonstration trajectories—by proposing a demonstration-free, reward-free, end-to-end training framework grounded in vision-language models (VLMs). Methodologically, it leverages a frozen, off-the-shelf VLM to jointly parse natural language task instructions and scene images, then integrates keypoint detection to generate coarse-grained 3D hand-object coordination trajectories for high-level exploration guidance; low-level residual reinforcement learning subsequently refines these into high-fidelity execution policies. The framework requires no reference trajectories, reward engineering, or VLM fine-tuning, yet achieves strong generalization across diverse articulated object manipulation and semantics-aware tasks, including successful sim-to-real transfer. Its core contribution lies in the first direct utilization of VLMs’ cross-modal understanding capability to induce embodied motion priors for dexterous manipulation, thereby substantially lowering the barrier to learning complex manipulation policies.

Technology Category

Application Category

📝 Abstract

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., "open the cabinet") and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or "scaffolds" with high fidelity. Across a number of simulated tasks involving articulated objects and semantic understanding, we demonstrate that our method is able to learn robust dexterous manipulation policies. Moreover, we showcase that our method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.

Problem

Research questions and friction points this paper is trying to address.

Training dexterous robotic hands is challenging due to high-dimensional control.

Reinforcement learning needs task-specific rewards, limiting scalability and generalization.

Sourcing suitable reference trajectories for dexterous manipulation remains difficult.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for task guidance

Synthesizes 3D trajectories from keypoints

Trains RL policies with coarse trajectory scaffolds

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)