π€ AI Summary
To address the high cost, low efficiency, and robot-dependency of tactile data acquisition in dexterous manipulation, this paper proposes a robot-free, teleoperation-free portable visuotactile manipulation interface. The interface integrates a Fin Rayβbased soft gripper with a high-density tactile sensor array, enabling efficient, contact-intensive, hand-held visuotactile co-acquisition. Furthermore, we design a cross-modal self-supervised pretraining method to learn robust multimodal tactile representations and build an end-to-end imitation learning framework. Evaluated on seven representative contact-rich tasks, our approach achieves over 3Γ higher data collection efficiency than baseline methods, while significantly improving policy generalization and robustness to disturbances. Key innovations include (i) the first handheld visuotactile co-acquisition paradigm and (ii) a novel multimodal representation learning mechanism tailored for few-shot tactile understanding.
π Abstract
Tactile information plays a crucial role for humans and robots to interact effectively with their environment, particularly for tasks requiring the understanding of contact properties. Solving such dexterous manipulation tasks often relies on imitation learning from demonstration datasets, which are typically collected via teleoperation systems and often demand substantial time and effort. To address these challenges, we present ViTaMIn, an embodiment-free manipulation interface that seamlessly integrates visual and tactile sensing into a hand-held gripper, enabling data collection without the need for teleoperation. Our design employs a compliant Fin Ray gripper with tactile sensing, allowing operators to perceive force feedback during manipulation for more intuitive operation. Additionally, we propose a multimodal representation learning strategy to obtain pre-trained tactile representations, improving data efficiency and policy robustness. Experiments on seven contact-rich manipulation tasks demonstrate that ViTaMIn significantly outperforms baseline methods, demonstrating its effectiveness for complex manipulation tasks.