Tactile Modality Fusion for Vision-Language-Action Models

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action models are limited in contact-intensive manipulation tasks due to the absence of tactile perception. This work proposes TacFiLM, a method that leverages feature-wise linear modulation (FiLM) to lightweightly integrate pretrained tactile representations with intermediate visual features during fine-tuning, avoiding complex token concatenation or extensive retraining. TacFiLM significantly improves success rates, direct insertion performance, task completion efficiency, and force stability in insertion tasks. Moreover, it demonstrates strong robustness across both in-distribution and out-of-distribution scenarios, achieving efficient and generalizable multimodal tactile enhancement.

Technology Category

Application Category

📝 Abstract
We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.
Problem

Research questions and friction points this paper is trying to address.

tactile modality
vision-language-action models
contact-rich manipulation
modality fusion
lightweight fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

tactile fusion
vision-language-action models
FiLM
lightweight modality fusion
contact-rich manipulation
🔎 Similar Papers
No similar papers found.