🤖 AI Summary
Transferring human video demonstrations to robotic manipulation policies faces challenges including absence of action annotations and significant morphological disparities between humans and robots. Method: This paper proposes a real-to-sim-to-real paradigm leveraging object motion trajectories as dense, cross-modal supervision signals. It first reconstructs photorealistic simulation environments from monocular RGB-D videos and tracks object trajectories to define reward functions for simulation-based reinforcement learning. Subsequently, the learned policy is distilled into an image-conditioned diffusion model and deployed end-to-end on real robots via online domain adaptation—without requiring real-robot demonstration data. Contribution/Results: The approach introduces the novel “object motion as supervision” principle, eliminating dependence on human action labels or robot teleoperation data. It enables zero-shot viewpoint generalization and real-time environmental alignment. Evaluated on five manipulation tasks, it achieves a 30% average improvement in task progress and matches the performance of behavior cloning using only 10% of its data collection time.
📝 Abstract
Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Si introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.