🤖 AI Summary
Existing first-person videos lack tactile signals, making it challenging to model realistic physical interactions. To address this limitation, this work introduces EgoTouch, a large-scale, multi-view embodied dataset that provides densely annotated bimanual manipulation data with tactile sensing for the first time. The authors propose TouchAnything, a vision-to-tactile prediction framework capable of estimating continuous pressure distributions at hand-object contact points using only visual inputs. The system integrates RGB video from head-mounted and dual wrist-mounted cameras, 3D hand poses, and wearable tactile sensors, enabling high-fidelity tactile reconstruction through multi-view fusion. Experiments demonstrate that incorporating wrist-mounted views improves contact IoU by 5.0% and volume IoU by 6.1%, significantly outperforming single-view approaches.
📝 Abstract
Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.