TaF-VLA: Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models lack the physical intuition necessary to reason about force interactions in contact-rich tasks, hindering precise force control. This work proposes a paradigm shift from tactile–visual alignment to tactile–force alignment and introduces the TaF-VLA framework, which incorporates a novel TaF-Adapter encoder to explicitly map high-dimensional tactile signals to physical interaction forces, overcoming the conventional limitation of treating tactile inputs merely as visual textures. Leveraging a custom-built tactile–force automated data collection system and the multimodal TaF-Dataset, the proposed approach significantly outperforms current tactile–visual and vision-only baselines on real-world contact-intensive manipulation tasks, demonstrating the efficacy of cross-modal physical reasoning for robust force-aware control.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have recently emerged as powerful generalists for robotic manipulation. However, due to their predominant reliance on visual modalities, they fundamentally lack the physical intuition required for contact-rich tasks that require precise force regulation and physical reasoning. Existing attempts to incorporate vision-based tactile sensing into VLA models typically treat tactile inputs as auxiliary visual textures, thereby overlooking the underlying correlation between surface deformation and interaction dynamics. To bridge this gap, we propose a paradigm shift from tactile-vision alignment to tactile-force alignment. Here, we introduce TaF-VLA, a framework that explicitly grounds high-dimensional tactile observations in physical interaction forces. To facilitate this, we develop an automated tactile-force data acquisition device and curate the TaF-Dataset, comprising over 10 million synchronized tactile observations, 6-axis force/torque, and matrix force map. To align sequential tactile observations with interaction forces, the central component of our approach is the Tactile-Force Adapter (TaF-Adapter), a tactile sensor encoder that extracts discretized latent information for encoding tactile observations. This mechanism ensures that the learned representations capture history-dependent, noise-insensitive physical dynamics rather than static visual textures. Finally, we integrate this force-aligned encoder into a VLA backbone. Extensive real-world experiments demonstrate that TaF-VLA policy significantly outperforms state-of-the-art tactile-vision-aligned and vision-only baselines on contact-rich tasks, verifying its ability to achieve robust, force-aware manipulation through cross-modal physical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
tactile sensing
force regulation
physical reasoning
contact-rich manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

tactile-force alignment
vision-language-action models
force-aware manipulation
Tactile-Force Adapter
physical reasoning
Y
Yuzhe Huang
Beihang University
P
Pei Lin
ShanghaiTech University
Wanlin Li
Wanlin Li
Beijing Institute for General Artificial Intelligence (BIGAI)
roboticsforce and tactile sensor
D
Daohan Li
Beijing Institute for General Artificial Intelligence
Jiajun Li
Jiajun Li
HKUST
Computer Vision
J
Jiaming Jiang
ShanghaiTech University, Beijing Institute for General Artificial Intelligence
Chenxi Xiao
Chenxi Xiao
ShanghaiTech University
Motion PlanningTactileRobotics
Ziyuan Jiao
Ziyuan Jiao
UCLA
RoboticsTask and Motion PlanningMobile ManipulationRobotic Manipulation