AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the challenge that existing vision–language–action (VLA) models struggle to simultaneously achieve fine-grained interaction and real-time performance in contact-intensive manipulation tasks, while naive integration of tactile sensing often degrades pre-trained capabilities. To overcome this, the authors propose AT-VLA, a novel architecture featuring an adaptive tactile injection mechanism and a dual-stream tactile-reactive design. The former dynamically controls when and where tactile information is fused into the model, while the latter decouples high-frequency tactile feedback for low-latency control from low-frequency visual–language reasoning. This approach preserves the original VLA pre-training while enabling efficient tactile utilization and closed-loop response within 40 milliseconds, significantly improving task success rates and interaction precision in real-world experiments.
📝 Abstract
Vision-Language-Action (VLA) models have significantly advanced the capabilities of robotic agents in executing diverse tasks; however, they still face challenges in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment. To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations. Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s. Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks. The project page is available at: https://sites.google.com/view/at-vla.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
tactile feedback
contact-rich manipulation
real-time responsiveness
modality injection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Tactile Injection
Tactile Reaction Dual-Stream
Vision-Language-Action Models
Real-time Closed-loop Control
Multimodal Fusion
🔎 Similar Papers