🤖 AI Summary
Contact-rich manipulation tasks are challenging to perform robustly using vision alone, yet the development of tactile-driven policies is hindered by the high cost of real-world tactile data collection and the absence of a unified evaluation platform. To address this, this work proposes UniVTAC, the first unified vision–tactile simulation platform that supports diverse tactile sensors, enables scalable generation of controllable multimodal interaction data, and integrates dedicated encoders alongside a benchmark suite comprising eight representative tasks. Evaluated on the UniVTAC Benchmark, the proposed approach achieves an average task success rate improvement of 17.1% and demonstrates a 25% performance gain in real-robot experiments, thereby advancing standardized research on tactile perception for robotic manipulation.
📝 Abstract
Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at https://univtac.github.io/.