🤖 AI Summary
Existing wearable tactile sensors struggle to accurately perceive spatiotemporal contact patterns—including timing, location, and force—across the entire hand in unconstrained, real-world settings; moreover, large-scale, in-the-wild datasets synchronizing first-person video, full-hand tactile sensing, and hand pose remain unavailable. This work introduces OpenTouch, the first in-the-wild, first-person, full-hand tactile dataset, comprising 5.1 hours of multimodal synchronized data (RGB video, high-resolution tactile signals, and hand pose) and 2,900 fine-grained text-annotated video clips. It establishes the first high-fidelity cross-modal synchronization and joint annotation pipeline for real-world tactile capture. We further propose novel benchmarks for tactile-augmented cross-modal retrieval and classification. Leveraging tactile signal encoding and contrastive learning, we demonstrate that tactile cues serve as compact, highly discriminative modalities that significantly enhance robustness of vision–tactile correspondence and enable precise retrieval of associated tactile states directly from in-the-wild video.
📝 Abstract
The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.