🤖 AI Summary
Weak generalization of dexterous manipulation policies to novel environments, tasks, and robot embodiments—coupled with high cost and limited scale of real-world robotic data collection—hinders practical deployment. To address this, we propose DexWild-System: a crowdsourced, field-deployable human hand interaction data acquisition framework that integrates low-power mobile hand pose estimation with multi-source heterogeneous data alignment. Furthermore, we introduce a human-robot co-training framework featuring joint representation learning from human and robot demonstrations, augmented by an embodiment-agnostic transfer adaptation module, substantially reducing reliance on costly robot teleoperation data. Experiments demonstrate that our method achieves a 68.5% task success rate in unseen environments—3.9× higher than training solely on robot demonstration data—and improves cross-embodiment generalization by 5.8×.
📝 Abstract
Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io