π€ AI Summary
Current vision-language-action (VLA) models exhibit limited performance in tasks involving visual occlusion, fine manipulation, and physical contact. This work proposes a contact-aware multimodal fusion approach that integrates tactile modality into a Transformer-based policy and introduces a contact-detection gating mechanism to activate tactile tokens only upon physical contact. This design enables efficient coordination among vision, language, and touch while suppressing irrelevant sensory interference. The proposed method substantially enhances the modelβs generalization in complex manipulation scenarios: it improves average success rates by 20% on constrained disassembly tasks, achieves a 60% gain in within-box grasping, and delivers 2.1Γ the performance of the baseline under severe visual occlusion.
π Abstract
Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.