X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

📅 2025-05-11

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Transferring human video demonstrations to robotic manipulation policies faces challenges including absence of action annotations and significant morphological disparities between humans and robots. Method: This paper proposes a real-to-sim-to-real paradigm leveraging object motion trajectories as dense, cross-modal supervision signals. It first reconstructs photorealistic simulation environments from monocular RGB-D videos and tracks object trajectories to define reward functions for simulation-based reinforcement learning. Subsequently, the learned policy is distilled into an image-conditioned diffusion model and deployed end-to-end on real robots via online domain adaptation—without requiring real-robot demonstration data. Contribution/Results: The approach introduces the novel “object motion as supervision” principle, eliminating dependence on human action labels or robot teleoperation data. It enables zero-shot viewpoint generalization and real-time environmental alignment. Evaluated on five manipulation tasks, it achieves a 30% average improvement in task progress and matches the performance of behavior cloning using only 10% of its data collection time.

Technology Category

Application Category

📝 Abstract

Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Si introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.

Problem

Research questions and friction points this paper is trying to address.

Lack of action labels in human videos for robot imitation learning.

Failure of cross-embodiment mapping with significantly different embodiments.

Need for scalable real-to-sim-to-real robot policy learning without teleoperation data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconstructs photorealistic simulation from RGBD videos

Uses object motion for transferable RL rewards

Online domain adaptation aligns real and simulated observations

🔎 Similar Papers

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94