π€ AI Summary
This work addresses the challenge of learning unified, robust, and generalizable multimodal tactile representations to enhance task performance and physical property reasoning in robotic dexterous manipulation. To this end, we propose Sparsh-Xβthe first self-supervised representation model that jointly encodes image, audio, motion, and pressure tactile signals, trained on a million-scale real-world contact dataset to capture complementary spatiotemporal features. Key contributions include: (1) the first unified representation framework for multimodal tactile signals; (2) a novel tactile pretraining paradigm explicitly designed for physical property perception; and (3) a joint tactile-action representation learning scheme coupled with a real-to-simulation transfer adaptation framework. Evaluated on standard benchmarks, Sparsh-X achieves a 63% improvement in policy success rate, a 90% gain in object state recovery robustness, and a 48% increase in physical property classification accuracy.
π Abstract
We present Sparsh-X, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. Trained on ~1M contact-rich interactions collected with the Digit 360 sensor, Sparsh-X captures complementary touch signals at diverse temporal and spatial scales. By leveraging self-supervised learning, Sparsh-X fuses these modalities into a unified representation that captures physical properties useful for robot manipulation tasks. We study how to effectively integrate real-world touch representations for both imitation learning and tactile adaptation of sim-trained policies, showing that Sparsh-X boosts policy success rates by 63% over an end-to-end model using tactile images and improves robustness by 90% in recovering object states from touch. Finally, we benchmark Sparsh-X ability to make inferences about physical properties, such as object-action identification, material-quantity estimation, and force estimation. Sparsh-X improves accuracy in characterizing physical properties by 48% compared to end-to-end approaches, demonstrating the advantages of multisensory pretraining for capturing features essential for dexterous manipulation.