🤖 AI Summary
This work addresses the challenge of scarce multimodal training data for real-world robotic systems performing complex manipulation tasks. The authors propose a generative data engine based on extended reality (XR) that, for the first time, enables immersive physics-based simulation directly on XR headsets without requiring specialized hardware. By integrating human motion retargeting with physically guided, text-controllable video generation, the system constructs a high-quality multimodal synthetic dataset. Remarkably, vision-based policies trained exclusively on this synthetic data achieve zero-shot transfer to unseen real-world cluttered environments, demonstrating strong performance on dexterous manipulation tasks involving interactions among deformable objects, loose granular materials, and rigid bodies.
📝 Abstract
We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking multi-modal data to train real-world robotic systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with human-to-robot pose retargeting. Data collected is further amplified by a physics-guided video generation pipeline steerable via natural language specifications. We demonstrate zero-shot transfer of robot visual policies to unseen, cluttered, and badly lit evaluation environments, after training entirely on Lucid-XR's synthetic data. We include examples across dexterous manipulation tasks that involve soft materials, loosely bound particles, and rigid body contact. Project website: https://lucidxr.github.io