Trajectory Conditioned Cross-embodiment Skill Transfer

πŸ“… 2025-10-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the cross-modal alignment challenge in transferring robotic manipulation skills from human demonstration videos, arising from morphological discrepancies between humans and robots. We propose an end-to-end framework that requires neither paired data nor handcrafted reward functions. Our key innovation is the use of sparse optical flow trajectories as morphology-agnostic motion representations, effectively decoupling human kinematic constraints; combined with vision-language multimodal conditioning, it enables direct mapping from monocular demonstration videos to robot joint action sequences. Technically, the method integrates sparse optical flow extraction, cross-morphology video synthesis, and action translation modules. In MetaWorld simulations, our approach reduces FrΓ©chet Video Distance (FVD) and Keypoint Velocity Distance (KVD) by 39.6% and 36.6%, respectively, while improving task success rate by 16.7%. Validation in a real-world kitchen environment demonstrates strong generalization capability to physical settings.

Technology Category

Application Category

πŸ“ Abstract
Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6% and KVD by 36.6% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.
Problem

Research questions and friction points this paper is trying to address.

Learning robot skills from human videos despite embodiment differences
Overcoming limitations of paired datasets and hand-crafted rewards
Transferring human manipulation skills to robots across different embodiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses optical flow trajectories as embodiment-agnostic motion cues
Synthesizes robot videos conditioned on visual and textual inputs
Translates synthesized videos into executable robot actions
πŸ”Ž Similar Papers
No similar papers found.
Y
YuHang Tang
Northwestern Polytechnical University, Shanghai AI Laboratory
Y
Yixuan Lou
Northwestern Polytechnical University
Pengfei Han
Pengfei Han
Dalian University of Technology, previously with Tsinghua University
HydrologyModellingSnow and glacier melting
H
Haoming Song
Shanghai AI Laboratory, Shanghai Jiao Tong University
Xinyi Ye
Xinyi Ye
Shanghai AI Laboratory
D
Dong Wang
Shanghai AI Laboratory
B
Bin Zhao
Northwestern Polytechnical University, Shanghai AI Laboratory