🤖 AI Summary
Existing imitation learning approaches rely on high-quality demonstration data tailored to specific robot configurations, resulting in high data collection costs and poor generalizability. While handheld grippers facilitate data acquisition, they typically employ only a first-person wrist-mounted camera, lacking contextual scene awareness. This work proposes MV-UMI, the first framework to jointly leverage first-person and third-person visual inputs within a handheld data collection device, establishing a multi-view perception architecture. By integrating dual-perspective camera hardware with end-to-end imitation learning, MV-UMI mitigates domain shift between human demonstrations and robotic execution while preserving cross-platform portability. Evaluated on tasks requiring global scene understanding, MV-UMI achieves an average performance improvement of 47% over baseline methods. Its effectiveness and generalization capability are further validated across three distinct manipulation tasks, demonstrating robustness to varying object configurations, environments, and robot platforms.
📝 Abstract
Recent advances in imitation learning have shown great promise for developing robust robot manipulation policies from demonstrations. However, this promise is contingent on the availability of diverse, high-quality datasets, which are not only challenging and costly to collect but are often constrained to a specific robot embodiment. Portable handheld grippers have recently emerged as intuitive and scalable alternatives to traditional robotic teleoperation methods for data collection. However, their reliance solely on first-person view wrist-mounted cameras often creates limitations in capturing sufficient scene contexts. In this paper, we present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera to overcome this limitation. This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices. Our experimental results, including an ablation study, demonstrate that our MV-UMI framework improves performance in sub-tasks requiring broad scene understanding by approximately 47% across 3 tasks, confirming the effectiveness of our approach in expanding the range of feasible manipulation tasks that can be learned using handheld gripper systems, without compromising the cross-embodiment advantages inherent to such systems.