From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the cross-modal alignment challenge in transferring manipulation skills from human videos to robots, arising from morphological discrepancies. We propose Traj2Action: a framework that employs 3D end-effector trajectories as a unified intermediate representation, enabling motion disentanglement via 3D trajectory reconstruction and cross-modal alignment. Crucially, we introduce a co-denoising mechanism within a diffusion model to jointly optimize robot joint poses and gripper actions—a first in vision-to-robot policy learning. The framework supports end-to-end policy learning from large-scale, unpaired human video data. Experiments on the Franka platform demonstrate substantial improvements over the π₀ baseline: success rates increase by 27.0% for short-horizon and 22.25% for long-horizon manipulation tasks. Moreover, performance scales consistently with the volume of human video data. This work establishes a novel paradigm for morphology-agnostic skill transfer without requiring paired human–robot demonstrations.

Technology Category

Application Category

📝 Abstract
Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over $π_0$ baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning. Our project website, featuring code and video demonstrations, is available at https://anonymous.4open.science/w/Traj2Action-4A45/.
Problem

Research questions and friction points this paper is trying to address.

Transferring manipulation skills from humans to robots
Bridging the morphological gap between human and robot embodiments
Generating robot actions from human trajectory data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D trajectory as unified intermediate representation
Generates coarse trajectory plan from human data
Synthesizes robot-specific actions via co-denoising framework
🔎 Similar Papers
No similar papers found.
H
Han Zhou
MAPLE Lab, Westlake University
J
Jinjin Cao
MAPLE Lab, Westlake University
Liyuan Ma
Liyuan Ma
zhejiang university
image synthesis, generative modelGANDiffusion Model
Xueji Fang
Xueji Fang
Zhejiang University
Diffusion ModelsMultimodal Language ModelsComputer Vision
G
Guo-jun Qi
MAPLE Lab, Westlake University