🤖 AI Summary
This work addresses the cross-modal alignment challenge in transferring manipulation skills from human videos to robots, arising from morphological discrepancies. We propose Traj2Action: a framework that employs 3D end-effector trajectories as a unified intermediate representation, enabling motion disentanglement via 3D trajectory reconstruction and cross-modal alignment. Crucially, we introduce a co-denoising mechanism within a diffusion model to jointly optimize robot joint poses and gripper actions—a first in vision-to-robot policy learning. The framework supports end-to-end policy learning from large-scale, unpaired human video data. Experiments on the Franka platform demonstrate substantial improvements over the π₀ baseline: success rates increase by 27.0% for short-horizon and 22.25% for long-horizon manipulation tasks. Moreover, performance scales consistently with the volume of human video data. This work establishes a novel paradigm for morphology-agnostic skill transfer without requiring paired human–robot demonstrations.
📝 Abstract
Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over $π_0$ baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning. Our project website, featuring code and video demonstrations, is available at https://anonymous.4open.science/w/Traj2Action-4A45/.