MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing imitation learning approaches rely on high-quality demonstration data tailored to specific robot configurations, resulting in high data collection costs and poor generalizability. While handheld grippers facilitate data acquisition, they typically employ only a first-person wrist-mounted camera, lacking contextual scene awareness. This work proposes MV-UMI, the first framework to jointly leverage first-person and third-person visual inputs within a handheld data collection device, establishing a multi-view perception architecture. By integrating dual-perspective camera hardware with end-to-end imitation learning, MV-UMI mitigates domain shift between human demonstrations and robotic execution while preserving cross-platform portability. Evaluated on tasks requiring global scene understanding, MV-UMI achieves an average performance improvement of 47% over baseline methods. Its effectiveness and generalization capability are further validated across three distinct manipulation tasks, demonstrating robustness to varying object configurations, environments, and robot platforms.

Technology Category

Application Category

📝 Abstract

Recent advances in imitation learning have shown great promise for developing robust robot manipulation policies from demonstrations. However, this promise is contingent on the availability of diverse, high-quality datasets, which are not only challenging and costly to collect but are often constrained to a specific robot embodiment. Portable handheld grippers have recently emerged as intuitive and scalable alternatives to traditional robotic teleoperation methods for data collection. However, their reliance solely on first-person view wrist-mounted cameras often creates limitations in capturing sufficient scene contexts. In this paper, we present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera to overcome this limitation. This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices. Our experimental results, including an ablation study, demonstrate that our MV-UMI framework improves performance in sub-tasks requiring broad scene understanding by approximately 47% across 3 tasks, confirming the effectiveness of our approach in expanding the range of feasible manipulation tasks that can be learned using handheld gripper systems, without compromising the cross-embodiment advantages inherent to such systems.

Problem

Research questions and friction points this paper is trying to address.

Handheld grippers lack sufficient scene context due to limited first-person views

Domain shifts occur between human demonstrations and robot deployment scenarios

Existing data collection methods are constrained to specific robot embodiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates third-person view with egocentric camera

Mitigates domain shift for cross-embodiment learning

Improves scene understanding performance by 47%

🔎 Similar Papers

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents