OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current vision-language-action (VLA) models exhibit limited generalization in contact-intensive robotic manipulation due to the omission of tactile sensing. To address this, we propose a semantic-aligned multimodal manipulation framework: (1) a dual-path tactile encoder and the SA-ViT tactile representation model; (2) ObjTac—the first large-scale force-sensitive tactile dataset; and (3) cross-modal unified representation learning integrating vision, touch, language, and action. Evaluated on real robotic platforms, our approach achieves a 96.9% grasping success rate (+21.9% absolute improvement), 100% success on dexterous manipulation tasks, smoother trajectories, and significantly reduced execution time. Our core contribution is the first deep integration of semantic-aligned tactile perception into the VLA paradigm, effectively overcoming key bottlenecks of tactile data heterogeneity and scarcity.

Technology Category

Application Category

📝 Abstract

Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset capturing textual, visual, and tactile information for 56 objects across 10 categories. With 135K tri-modal samples, ObjTac supplements existing visuo-tactile datasets. Third, leveraging this dataset, we train a semantically-aligned tactile encoder to learn a unified tactile representation, serving as a better initialization for OmniVTLA. Real-world experiments demonstrate substantial improvements over state-of-the-art VLA baselines, achieving 96.9% success rates with grippers, (21.9% higher over baseline) and 100% success rates with dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides, OmniVTLA significantly reduces task completion time and generates smoother trajectories through tactile sensing compared to existing VLA.

Problem

Research questions and friction points this paper is trying to address.

Overcoming tactile data scarcity in robot manipulation models

Integrating heterogeneous tactile sensors for enhanced perception

Improving contact-rich task performance via multimodal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-path tactile encoder with ViT and SA-ViT

ObjTac dataset with tri-modal tactile samples

Semantically-aligned unified tactile representation learning

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

AI Research Scientist, Robotics