OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

πŸ“… 2025-08-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language-action (VLA) models exhibit limited generalization in contact-intensive robotic manipulation due to the omission of tactile sensing. To address this, we propose a semantic-aligned multimodal manipulation framework: (1) a dual-path tactile encoder and the SA-ViT tactile representation model; (2) ObjTacβ€”the first large-scale force-sensitive tactile dataset; and (3) cross-modal unified representation learning integrating vision, touch, language, and action. Evaluated on real robotic platforms, our approach achieves a 96.9% grasping success rate (+21.9% absolute improvement), 100% success on dexterous manipulation tasks, smoother trajectories, and significantly reduced execution time. Our core contribution is the first deep integration of semantic-aligned tactile perception into the VLA paradigm, effectively overcoming key bottlenecks of tactile data heterogeneity and scarcity.

Technology Category

Application Category

πŸ“ Abstract
Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA.
Problem

Research questions and friction points this paper is trying to address.

Overcoming tactile data scarcity in robot manipulation models
Integrating heterogeneous tactile sensors for enhanced perception
Improving contact-rich task performance via multimodal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-path tactile encoder with ViT and SA-ViT
ObjTac dataset with tri-modal tactile samples
Semantically-aligned unified tactile representation learning
πŸ”Ž Similar Papers
No similar papers found.