🤖 AI Summary
This work addresses the challenge of accurately identifying semantic event boundaries in long-horizon manipulation tasks rich in physical contact, where reliance solely on visual and proprioceptive cues proves insufficient for effective task segmentation. To overcome this limitation, the authors propose TacUMI—a compact, multimodal data acquisition system that integrates ViTac visuo-tactile sensing with force-torque and pose perception—and, for the first time, embed it within a general-purpose manipulation interface to enable highly synchronized multimodal recording. Building upon this hardware foundation, they further introduce a temporal modeling–based multimodal fusion framework to automatically extract event boundaries from human demonstrations. Evaluated on a cable assembly task, the method achieves over 90% segmentation accuracy, significantly outperforming unimodal baselines and demonstrating the critical role of multimodal perception in enhancing task decomposition performance.
📝 Abstract
Task decomposition is critical for understanding and learning complex long-horizon manipulation tasks. Especially for tasks involving rich physical interactions, relying solely on visual observations and robot proprioceptive information often fails to reveal the underlying event transitions. This raises the requirement for efficient collection of high-quality multi-modal data as well as robust segmentation method to decompose demonstrations into meaningful modules. Building on the idea of the handheld demonstration device Universal Manipulation Interface (UMI), we introduce TacUMI, a multi-modal data collection system that integrates additionally ViTac sensors, force-torque sensor, and pose tracker into a compact, robot-compatible gripper design, which enables synchronized acquisition of all these modalities during human demonstrations. We then propose a multi-modal segmentation framework that leverages temporal models to detect semantically meaningful event boundaries in sequential manipulations. Evaluation on a challenging cable mounting task shows more than 90 percent segmentation accuracy and highlights a remarkable improvement with more modalities, which validates that TacUMI establishes a practical foundation for both scalable collection and segmentation of multi-modal demonstrations in contact-rich tasks.