🤖 AI Summary
This work addresses the high cost and poor generalizability of teleoperation-based data collection for robot learning. We propose a novel paradigm for automatically generating cross-morphology robot training data from monocular RGB(D) human hand–object interaction (HOI) videos. Our method integrates high-fidelity HOI reconstruction, physics-grounded reinforcement learning optimization, morphology-agnostic action representation, and cross-morphology trajectory retargeting, augmented by Isaac Sim simulation and domain randomization to form an end-to-end data generation pipeline. We provide the first empirical validation that hand–object interaction videos serve as high-quality supervision signals for robot learning. Crucially, our lightweight, universal action representation eliminates dependency on specific robot kinematics. Experiments demonstrate that the generated data achieves performance on par with teleoperated data across mainstream vision-language-action (VLA) and imitation learning models, while significantly improving cross-task and cross-morphology generalization. To foster community advancement, we open-source a large-scale, multimodal dataset.
📝 Abstract
We introduce Robowheel, a data engine that converts human hand object interaction (HOI) videos into training-ready supervision for cross morphology robotic learning. From monocular RGB or RGB-D inputs, we perform high precision HOI reconstruction and enforce physical plausibility via a reinforcement learning (RL) optimizer that refines hand object relative poses under contact and penetration constraints. The reconstructed, contact rich trajectories are then retargeted to cross-embodiments, robot arms with simple end effectors, dexterous hands, and humanoids, yielding executable actions and rollouts. To scale coverage, we build a simulation-augmented framework on Isaac Sim with diverse domain randomization (embodiments, trajectories, object retrieval, background textures, hand motion mirroring), which enriches the distributions of trajectories and observations while preserving spatial relationships and physical plausibility. The entire data pipeline forms an end to end pipeline from video,reconstruction,retargeting,augmentation data acquisition. We validate the data on mainstream vision language action (VLA) and imitation learning architectures, demonstrating that trajectories produced by our pipeline are as stable as those from teleoperation and yield comparable continual performance gains. To our knowledge, this provides the first quantitative evidence that HOI modalities can serve as effective supervision for robotic learning. Compared with teleoperation, Robowheel is lightweight, a single monocular RGB(D) camera is sufficient to extract a universal, embodiment agnostic motion representation that could be flexibly retargeted across embodiments. We further assemble a large scale multimodal dataset combining multi-camera captures, monocular videos, and public HOI corpora for training and evaluating embodied models.