🤖 AI Summary
Existing teleoperation datasets suffer from poor scalability, unsmooth trajectories, and weak cross-platform generalization, limiting their applicability to complex real-world manipulation tasks. To address these limitations, we propose FastUMI—a modular, hardware-agnostic lightweight teleoperation framework integrating multi-view fisheye vision, end-effector state sensing, and natural-language annotations, coupled with efficient pose tracking for high-fidelity multimodal trajectory acquisition. Leveraging FastUMI, we construct UMI-Home: a large-scale robot manipulation dataset comprising over 100,000 household-scene demonstrations in the UMI style. UMI-Home significantly improves data scale, trajectory smoothness, and adaptability across diverse robotic platforms. Evaluated on multiple imitation learning and reinforcement learning baselines, UMI-Home achieves consistently higher success rates—demonstrating its robust modeling capability for dynamic, long-horizon manipulation tasks and practical deployment value.
📝 Abstract
Data-driven robotic manipulation learning depends on large-scale, high-quality expert demonstration datasets. However, existing datasets, which primarily rely on human teleoperated robot collection, are limited in terms of scalability, trajectory smoothness, and applicability across different robotic embodiments in real-world environments. In this paper, we present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset, designed to overcome these limitations and meet the growing complexity of real-world manipulation tasks. Collected by FastUMI, a novel robotic system featuring a modular, hardware-decoupled mechanical design and an integrated lightweight tracking system, FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data. Specifically, FastUMI-100K contains over 100K+ demonstration trajectories collected across representative household environments, covering 54 tasks and hundreds of object types. Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations. Each trajectory has a length ranging from 120 to 500 frames. Experimental results demonstrate that FastUMI-100K enables high policy success rates across various baseline algorithms, confirming its robustness, adaptability, and real-world applicability for solving complex, dynamic manipulation challenges. The source code and dataset will be released in this link https://github.com/MrKeee/FastUMI-100K.