AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenge that existing vision–language–action (VLA) models struggle to simultaneously achieve fine-grained interaction and real-time performance in contact-intensive manipulation tasks, while naive integration of tactile sensing often degrades pre-trained capabilities. To overcome this, the authors propose AT-VLA, a novel architecture featuring an adaptive tactile injection mechanism and a dual-stream tactile-reactive design. The former dynamically controls when and where tactile information is fused into the model, while the latter decouples high-frequency tactile feedback for low-latency control from low-frequency visual–language reasoning. This approach preserves the original VLA pre-training while enabling efficient tactile utilization and closed-loop response within 40 milliseconds, significantly improving task success rates and interaction precision in real-world experiments.

📝 Abstract

Vision-Language-Action (VLA) models have significantly advanced the capabilities of robotic agents in executing diverse tasks; however, they still face challenges in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment. To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations. Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s. Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks. The project page is available at: https://sites.google.com/view/at-vla.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

tactile feedback

contact-rich manipulation

real-time responsiveness

modality injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Tactile Injection

Tactile Reaction Dual-Stream

Vision-Language-Action Models

Real-time Closed-loop Control

Multimodal Fusion

🔎 Similar Papers

A Survey on Vision-Language-Action Models for Embodied AI

2024-05-23arXiv.orgCitations: 18

💼 Related Jobs

AI Research Scientist, Robotics